Engenharias
Prévia do material em texto
https://www.iteblog.comLearning PySparkhttps://www.iteblog.comTable of ContentsLearning PySparkCreditsForewordAbout the AuthorsAbout the Reviewerwww.PacktPub.comCustomer FeedbackPrefaceWhat this book coversWhat you need for this bookWho this book is forConventionsReader feedbackCustomer supportDownloading the example codeDownloading the color images of this bookErrataPiracyQuestions1. Understanding SparkWhat is Apache Spark?Spark Jobs and APIsExecution processResilient Distributed DatasetDataFramesDatasetsCatalyst OptimizerProject TungstenSpark 2.0 architectureUnifying Datasets and DataFramesIntroducing SparkSessionTungsten phase 2https://www.iteblog.comStructured StreamingContinuous applicationsSummary2. Resilient Distributed DatasetsInternal workings of an RDDCreating RDDsSchemaReading from filesLambda expressionsGlobal versus local scopeTransformationsThe .map(...) transformationThe .filter(...) transformationThe .flatMap(...) transformationThe .distinct(...) transformationThe .sample(...) transformationThe .leftOuterJoin(...) transformationThe .repartition(...) transformationActionsThe .take(...) methodThe .collect(...) methodThe .reduce(...) methodThe .count(...) methodThe .saveAsTextFile(...) methodThe .foreach(...) methodSummary3. DataFramesPython to RDD communicationsCatalyst Optimizer refreshSpeeding up PySpark with DataFramesCreating DataFramesGenerating our own JSON dataCreating a DataFrameCreating a temporary tableSimple DataFrame queriesDataFrame API queryhttps://www.iteblog.comSQL queryInteroperating with RDDsInferring the schema using reflectionProgrammatically specifying the schemaQuerying with the DataFrame APINumber of rowsRunning filter statementsQuerying with SQLNumber of rowsRunning filter statements using the where ClausesDataFrame scenario – on-time flight performancePreparing the source datasetsJoining flight performance and airportsVisualizing our flight-performance dataSpark Dataset APISummary4. Prepare Data for ModelingChecking for duplicates, missing observations, and outliersDuplicatesMissing observationsOutliersGetting familiar with your dataDescriptive statisticsCorrelationsVisualizationHistogramsInteractions between featuresSummary5. Introducing MLlibOverview of the packageLoading and transforming the dataGetting to know your dataDescriptive statisticsCorrelationsStatistical testingCreating the final datasethttps://www.iteblog.comCreating an RDD of LabeledPointsSplitting into training and testingPredicting infant survivalLogistic regression in MLlibSelecting only the most predictable featuresRandom forest in MLlibSummary6. Introducing the ML PackageOverview of the packageTransformerEstimatorsClassificationRegressionClusteringPipelinePredicting the chances of infant survival with MLLoading the dataCreating transformersCreating an estimatorCreating a pipelineFitting the modelEvaluating the performance of the modelSaving the modelParameter hyper-tuningGrid searchTrain-validation splittingOther features of PySpark ML in actionFeature extractionNLP - related feature extractorsDiscretizing continuous variablesStandardizing continuous variablesClassificationClusteringFinding clusters in the births datasetTopic miningRegressionhttps://www.iteblog.comSummary7. GraphFramesIntroducing GraphFramesInstalling GraphFramesCreating a libraryPreparing your flights datasetBuilding the graphExecuting simple queriesDetermining the number of airports and tripsDetermining the longest delay in this datasetDetermining the number of delayed versus on-time/early flightsWhat flights departing Seattle are most likely to have significantdelays?What states tend to have significant delays departing from Seattle?Understanding vertex degreesDetermining the top transfer airportsUnderstanding motifsDetermining airport ranking using PageRankDetermining the most popular non-stop flightsUsing Breadth-First SearchVisualizing flights using D3Summary8. TensorFramesWhat is Deep Learning?The need for neural networks and Deep LearningWhat is feature engineering?Bridging the data and algorithmWhat is TensorFlow?Installing PipInstalling TensorFlowMatrix multiplication using constantsMatrix multiplication using placeholdersRunning the modelRunning another modelDiscussionIntroducing TensorFrameshttps://www.iteblog.comTensorFrames – quick startConfiguration and setupLaunching a Spark clusterCreating a TensorFrames libraryInstalling TensorFlow on your clusterUsing TensorFlow to add a constant to an existing columnExecuting the Tensor graphBlockwise reducing operations exampleBuilding a DataFrame of vectorsAnalysing the DataFrameComputing elementwise sum and min of all vectorsSummary9. Polyglot Persistence with BlazeInstalling BlazePolyglot persistenceAbstracting dataWorking with NumPy arraysWorking with pandas' DataFrameWorking with filesWorking with databasesInteracting with relational databasesInteracting with the MongoDB databaseData operationsAccessing columnsSymbolic transformationsOperations on columnsReducing dataJoinsSummary10. Structured StreamingWhat is Spark Streaming?Why do we need Spark Streaming?What is the Spark Streaming application data flow?Simple streaming application using DStreamsA quick primer on global aggregationsIntroducing Structured Streaminghttps://www.iteblog.comSummary11. Packaging Spark ApplicationsThe spark-submit commandCommand line parametersDeploying the app programmaticallyConfiguring your SparkSessionCreating SparkSessionModularizing codeStructure of the moduleCalculating the distance between two pointsConverting distance unitsBuilding an eggUser defined functions in SparkSubmitting a jobMonitoring executionDatabricks JobsSummaryIndexhttps://www.iteblog.comLearning PySparkhttps://www.iteblog.comLearning PySparkCopyright © 2017 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in aretrieval system, or transmitted in any form or by any means, withoutthe prior written permission of the publisher, except in the case of briefquotations embedded in critical articles or reviews.Every effort has been made in the preparation of this book to ensure theaccuracy of the information presented. However, the informationcontained in this book is sold without warranty, either express orimplied. Neither the authors, nor Packt Publishing, and its dealers anddistributors will be held liable for any damages caused or alleged to becaused directly or indirectly by this book.Packt Publishing has endeavored to provide trademark informationabout all of the companies and products mentioned in this book by theappropriate use of capitals. However, Packt Publishing cannot guaranteethe accuracy of this information.First published: February 2017Production reference: 1220217Published by Packt Publishing Ltd.Livery Place35 Livery StreetBirmingham B3 2PB, UK.ISBN 978-1-78646-370-8www.packtpub.comhttps://www.iteblog.comhttp://www.packtpub.comCreditsAuthorsTomasz DrabasDenny LeeReviewerHolden KarauCommissioning EditorAmey VarangaonkarAcquisition EditorPrachi BishtContent Development EditorAmrita NoronhaTechnical EditorAkash PatelCopy EditorSafis EditingProject CoordinatorShweta H BirwatkarProofreaderhttps://www.iteblog.comSafis EditingIndexerAishwarya GangawaneGraphicsDisha HariaProduction CoordinatorAparna BhagatCover WorkAparna Bhagathttps://www.iteblog.comForewordThank you for choosing this book to start your PySpark adventures, Ihope you are as excited as I am. When Denny Lee first told me aboutthis new book I was delighted-one of the most important things thatmakes Apache Spark such a wonderful platform, is supporting both the'/Users/drabast/Documents/PySpark_Data/data_key.txt')To read it back, you need to parse it back as all the rows are treated asstrings:def parseInput(row): import re pattern = re.compile(r'\(\'([a-z])\', ([0-9])\)') row_split = pattern.split(row) return (row_split[1], int(row_split[2])) data_key_reread = sc \ .textFile( '/Users/drabast/Documents/PySpark_Data/data_key.txt') \ .map(parseInput)data_key_reread.collect()The list of keys read matches what we had initially:The .foreach(...) methodThis is a method that applies the same function to each element of theRDD in an iterative way; in contrast to .map(..), the .foreach(...)method applies a defined function to each record in a one-by-onefashion. It is useful when you want to save the data to a database that isnot natively supported by PySpark.https://www.iteblog.comHere, we'll use it to print (to CLI - not the Jupyter Notebook) all therecords that are stored in data_key RDD:def f(x): print(x)data_key.foreach(f)If you now navigate to CLI you should see all the records printed out.Note, that every time the order will most likely be different.https://www.iteblog.comSummaryRDDs are the backbone of Spark; these schema-less data structures arethe most fundamental data structures that we will deal with withinSpark.In this chapter, we presented ways to create RDDs from text files, bymeans of the .parallelize(...) method as well as by reading data fromtext files. Also, some ways of processing unstructured data were shown.Transformations in Spark are lazy - they are only applied when an actionis called. In this chapter, we discussed and presented the mostcommonly used transformations and actions; the PySparkdocumentation contains many morehttp://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDDOne major distinction between Scala and Python RDDs is speed: PythonRDDs can be much slower than their Scala counterparts.In the next chapter we will walk you through a data structure that madePySpark applications perform on par with those written in Scala - theDataFrames.https://www.iteblog.comhttp://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDDChapter 3. DataFramesA DataFrame is an immutable distributed collection of data that isorganized into named columns analogous to a table in a relationaldatabase. Introduced as an experimental feature within Apache Spark1.0 as SchemaRDD, they were renamed to DataFrames as part of theApache Spark 1.3 release. For readers who are familiar with PythonPandas DataFrame or R DataFrame, a Spark DataFrame is a similarconcept in that it allows users to easily work with structured data (forexample, data tables); there are some differences as well so pleasetemper your expectations.By imposing a structure onto a distributed collection of data, this allowsSpark users to query structured data in Spark SQL or using expressionmethods (instead of lambdas). In this chapter, we will include codesamples using both methods. By structuring your data, this allows theApache Spark engine – specifically, the Catalyst Optimizer – tosignificantly improve the performance of Spark queries. In earlier APIsof Spark (that is, RDDs), executing queries in Python could besignificantly slower due to communication overhead between the JavaJVM and Py4J.NoteIf you are familiar with working with DataFrames in previous versionsof Spark (that is Spark 1.x), you will notice that in Spark 2.0 we areusing SparkSession instead of SQLContext. The various Spark contexts:HiveContext, SQLContext, StreamingContext, and SparkContext havemerged together in SparkSession. This way you will be working with thissession only as an entry point for reading data, working with metadata,configuration, and cluster resource management.For more information, please refer to How to use SparkSession inApache Spark 2.0(http://bit.ly/2br0Fr1).In this chapter, you will learn about the following:https://www.iteblog.comhttp://bit.ly/2br0Fr1Python to RDD communicationsA quick refresh of Spark's Catalyst OptimizerSpeeding up PySpark with DataFramesCreating DataFramesSimple DataFrame queriesInteroperating with RDDsQuerying with the DataFrame APIQuerying with Spark SQLUsing DataFrames for an on-time flight performancePython to RDD communicationsWhenever a PySpark program is executed using RDDs, there is apotentially large overhead to execute the job. As noted in the followingdiagram, in the PySpark driver, the Spark Context uses Py4j to launch aJVM using the JavaSparkContext. Any RDD transformations areinitially mapped to PythonRDD objects in Java.Once these tasks are pushed out to the Spark Worker(s), PythonRDDobjects launch Python subprocesses using pipes to send both code anddata to be processed within Python:https://www.iteblog.comWhile this approach allows PySpark to distribute the processing of thedata to multiple Python subprocesses on multiple workers, as you cansee, there is a lot of context switching and communications overheadbetween Python and the JVM.NoteAn excellent resource on PySpark performance is Holden Karau'sImproving PySpark Performance: Spark performance beyond the JVM:http://bit.ly/2bx89bn.https://www.iteblog.comhttp://bit.ly/2bx89bnCatalyst Optimizer refreshAs noted in Chapter 1, Understanding Spark, one of the primaryreasons the Spark SQL engine is so fast is because of the CatalystOptimizer. For readers with a database background, this diagram lookssimilar to the logical/physical planner and cost model/cost-basedoptimization of a relational database management system (RDBMS):The significance of this is that, as opposed to immediately processing thequery, the Spark engine's Catalyst Optimizer compiles and optimizes alogical plan and has a cost optimizer that determines the most efficientphysical plan generated.NoteAs noted in earlier chapters, while the Spark SQL Engine has both rules-based and cost-based optimizations that include (but are not limited to)predicate push down and column pruning. Targeted for the ApacheSpark 2.2 release, the jira item [SPARK-16026] Cost-based OptimizerFramework at https://issues.apache.org/jira/browse/SPARK-16026 is anumbrella ticket to implement a cost-based optimizer framework beyondbroadcast join selection. For more information, please refer to thehttps://www.iteblog.comhttps://issues.apache.org/jira/browse/SPARK-16026Design Specification of Spark Cost-Based Optimization athttp://bit.ly/2li1t4T.As part of Project Tungsten, there are further improvements toperformance by generating byte code (code generation or codegen)instead of interpreting each row of data. Find more details on Tungstenin the Project Tungsten section in Chapter 1, Understanding Spark.As previously noted, the optimizer is based on functional programmingconstructs and was designed with two purposes in mind: to ease theadding of new optimization techniques and features to Spark SQL, andto allow external developers to extend the optimizer (for example,adding data-source-specific rules, support for new data types, and soon).NoteFor more information, please refer to Michael Armbrust's excellentpresentation, Structuring Spark: SQL DataFrames, Datasets, andStreaming at http://bit.ly/2cJ508x.For further understanding of the Catalyst Optimizer, please refer toDeep Dive into Spark SQL's Catalyst Optimizer athttp://bit.ly/2bDVB1T.Also, for more information on Project Tungsten, please refer to ProjectTungsten: Bringing Apache Spark Closer to Bare Metal athttp://bit.ly/2bQIlKY, and Apache Spark as a Compiler: Joining aBillion Rows per Second on a Laptop at http://bit.ly/2bDWtnc.https://www.iteblog.comhttp://bit.ly/2li1t4Thttp://bit.ly/2cJ508xhttp://bit.ly/2bDVB1Thttp://bit.ly/2bQIlKYhttp://bit.ly/2bDWtncSpeeding up PySpark withDataFramesThe significance of DataFrames and the Catalyst Optimizer (andProjectTungsten) is the increase in performance of PySpark querieswhen compared to non-optimized RDD queries. As shown in thefollowing figure, prior to the introduction of DataFrames, Python queryspeeds were often twice as slow as the same Scala queries using RDD.Typically, this slowdown in query performance was due to thecommunications overhead between Python and the JVM:Source: Introducing DataFrames in Apache-spark for Large ScaleData Science at http://bit.ly/2blDBI1With DataFrames, not only was there a significant improvement inPython performance, there is now performance parity between Python,Scala, SQL, and R.TipIt is important to note that while, with DataFrames, PySpark is oftensignificantly faster, there are some exceptions. The most prominent onehttps://www.iteblog.comhttp://bit.ly/2blDBI1is the use of Python UDFs, which results in round-trip communicationbetween Python and the JVM. Note, this would be the worst-casescenario which would be similar if the compute was done on RDDs.Python can take advantage of the performance optimizations in Sparkeven while the codebase for the Catalyst Optimizer is written in Scala.Basically, it is a Python wrapper of approximately 2,000 lines of codethat allows PySpark DataFrame queries to be significantly faster.Altogether, Python DataFrames (as well as SQL, Scala DataFrames, andR DataFrames) are all able to make use of the Catalyst Optimizer (as perthe following updated diagram):NoteFor more information, please refer to the blog post IntroducingDataFrames in Apache Spark for Large Scale Data Science athttp://bit.ly/2blDBI1, as well as Reynold Xin's Spark Summit 2015presentation, From DataFrames to Tungsten: A Peek into Spark'sFuture at http://bit.ly/2bQN92T.https://www.iteblog.comhttp://bit.ly/2blDBI1http://bit.ly/2bQN92TCreating DataFramesTypically, you will create DataFrames by importing data usingSparkSession (or calling spark in the PySpark shell).TipIn Spark 1.x versions, you typically had to use sqlContext.In future chapters, we will discuss how to import data into your local filesystem, Hadoop Distributed File System (HDFS), or other cloudstorage systems (for example, S3 or WASB). For this chapter, we willfocus on generating your own DataFrame data directly within Spark orutilizing the data sources already available within DatabricksCommunity Edition.NoteFor instructions on how to sign up for the Community Edition ofDatabricks, see the bonus chapter, Free Spark Cloud Offering.First, instead of accessing the file system, we will create a DataFrame bygenerating the data. In this case, we'll first create the stringJSONRDDRDD and then convert it into a DataFrame. This code snippet creates anRDD comprised of swimmers (their ID, name, age, and eye color) inJSON format.Generating our own JSON dataBelow, we will generate initially generate the stringJSONRDD RDD:stringJSONRDD = sc.parallelize((""" { "id": "123","name": "Katie","age": 19,"eyeColor": "brown" }""","""{"id": "234",https://www.iteblog.com"name": "Michael","age": 22,"eyeColor": "green" }""", """{"id": "345","name": "Simone","age": 23,"eyeColor": "blue" }"""))Now that we have created the RDD, we will convert this into aDataFrame by using the SparkSession read.json method (that is,spark.read.json(...)). We will also create a temporary table by usingthe .createOrReplaceTempView method.NoteIn Spark 1.x, this method was.registerTempTable, which is beingdeprecated as part of Spark 2.x.Creating a DataFrameHere is the code to create a DataFrame:swimmersJSON = spark.read.json(stringJSONRDD)Creating a temporary tableHere is the code for creating a temporary table:swimmersJSON.createOrReplaceTempView("swimmersJSON")As noted in the previous chapters, many RDD operations aretransformations, which are not executed until an action operation isexecuted. For example, in the preceding code snippet, thesc.parallelize is a transformation that is executed when convertingfrom an RDD to a DataFrame by using spark.read.json. Notice that, inthe screenshot of this code snippet notebook (near the bottom left), theSpark job is not executed until the second cell containing thehttps://www.iteblog.comspark.read.json operation.TipThese are screenshots from Databricks Community Edition, but all thecode samples and Spark UI screenshots can be executed/viewed in anyflavor of Apache Spark 2.x.To further emphasize the point, in the right pane of the following figure,we present the DAG graph of execution.NoteA great resource to better understand the Spark UI DAG visualization isthe blog post Understanding Your Apache Spark Application ThroughVisualization at http://bit.ly/2cSemkv.In the following screenshot, you can see the Spark job' sparallelizeoperation is from the first cell generating the RDD stringJSONRDD, whilethe map and mapPartitions operations are the operations required tocreate the DataFrame:https://www.iteblog.comhttp://bit.ly/2cSemkvSpark UI of the DAG visualization of thespark.read.json(stringJSONRDD) job.In the following screenshot, you can see the stages for the parallelizeoperation are from the first cell generating the RDD stringJSONRDD,while the map and mapPartitions operations are the operations requiredto create the DataFrame:https://www.iteblog.comhttps://www.iteblog.comSpark UI of the DAG visualization of the stages within thespark.read.json(stringJSONRDD) job.It is important to note that parallelize, map, and mapPartitions are allRDD transformations. Wrapped within the DataFrame operation,spark.read.json (in this case), are not only the RDD transformations,but also the action which converts the RDD into a DataFrame. This is animportant call out, because even though you are executing DataFrameoperations, to debug your operations you will need to remember thatyou will be making sense of RDD operations within the Spark UI.Note that creating the temporary table is a DataFrame transformationand not executed until a DataFrame action is executed (for example, inthe SQL query to be executed in the following section).NoteDataFrame transformations and actions are similar to RDDtransformations and actions in that there is a set of operations that arelazy (transformations). But, in comparison to RDDs, DataFramesoperations are not as lazy, primarily due to the Catalyst Optimizer. Formore information, please refer to Holden Karau and Rachel Warren'sbook High Performance Spark, http://highperformancespark.com/.https://www.iteblog.comhttp://highperformancespark.com/Simple DataFrame queriesNow that you have created the swimmersJSON DataFrame, we will beable to run the DataFrame API, as well as SQL queries against it. Let'sstart with a simple query showing all the rows within the DataFrame.DataFrame API queryTo do this using the DataFrame API, you can use the show() method,which prints the first n rows to the console:TipRunning the.show() method will default to present the first 10 rows.# DataFrame APIswimmersJSON.show()This gives the following output:SQL queryIf you prefer writing SQL statements, you can write the following query:spark.sql("select * from swimmersJSON").collect()This will give the following output:https://www.iteblog.comWe are using the .collect() method, which returns all the records as alist of Row objects. Note that you can use either the collect() orshow() method for both DataFrames and SQL queries. Just make surethat if you use .collect(), this is for a small DataFrame, since it willreturn all of the rows in the DataFrame and move them back from theexecutors to the driver. You can instead use take() or show(),which allow you to limit the number of rows returned by specifying :TipNote that, if you are using Databricks, you can use the %sql commandand run your SQL statement directly within a notebook cell, as noted.https://www.iteblog.comInteroperatingwith RDDsThere are two different methods for converting existing RDDs toDataFrames (or Datasets[T]): inferring the schema using reflection, orprogrammatically specifying the schema. The former allows you to writemore concise code (when your Spark application already knows theschema), while the latter allows you to construct DataFrames when thecolumns and their data types are only revealed at run time. Note,reflection is in reference to schema reflection as opposed to Pythonreflection.Inferring the schema using reflectionIn the process of building the DataFrame and running the queries, weskipped over the fact that the schema for this DataFrame wasautomatically defined. Initially, row objects are constructed by passing alist of key/value pairs as **kwargs to the row class. Then, Spark SQLconverts this RDD of row objects into a DataFrame, where the keys arethe columns and the data types are inferred by sampling the data.TipThe **kwargs construct allows you to pass a variable number ofparameters to a method at runtime.Going back to the code, after initially creating the swimmersJSONDataFrame, without specifying the schema, you will notice the schemadefinition by using the printSchema() method:# Print the schemaswimmersJSON.printSchema()This gives the following output:https://www.iteblog.comBut what if we want to specify the schema because, in this example, weknow that the id is actually a long instead of a string?Programmatically specifying the schemaIn this case, let's programmatically specify the schema by bringing inSpark SQL data types (pyspark.sql.types) and generate some .csvdata for this example:# Import typesfrom pyspark.sql.types import *# Generate comma delimited datastringCSVRDD = sc.parallelize([(123, 'Katie', 19, 'brown'), (234, 'Michael', 22, 'green'), (345, 'Simone', 23, 'blue')])First, we will encode the schema as a string, per the [schema] variablebelow. Then we will define the schema using StructType andStructField:# Specify schemaschema = StructType([StructField("id", LongType(), True), StructField("name", StringType(), True),StructField("age", LongType(), True),StructField("eyeColor", StringType(), True)])Note, the StructField class is broken down in terms of:https://www.iteblog.comname: The name of this fielddataType: The data type of this fieldnullable: Indicates whether values of this field can be nullFinally, we will apply the schema (schema) we created to thestringCSVRDD RDD (that is, the generated.csv data) and create atemporary view so we can query it using SQL:# Apply the schema to the RDD and Create DataFrameswimmers = spark.createDataFrame(stringCSVRDD, schema)# Creates a temporary view using the DataFrameswimmers.createOrReplaceTempView("swimmers")With this example, we have finer-grain control over the schema and canspecify that id is a long (as opposed to a string in the previous section):swimmers.printSchema()This gives the following output:TipIn many cases, the schema can be inferred (as per the previous section)and you do not need to specify the schema, as in this precedingexample.https://www.iteblog.comQuerying with the DataFrameAPIAs noted in the previous section, you can start off by using collect(),show(), or take() to view the data within your DataFrame (with the lasttwo including the option to limit the number of returned rows).Number of rowsTo get the number of rows within your DataFrame, you can use thecount() method:swimmers.count()This gives the following output:Out[13]: 3Running filter statementsTo run a filter statement, you can use the filter clause; in the followingcode snippet, we are using the select clause to specify the columns tobe returned as well:# Get the id, age where age = 22swimmers.select("id", "age").filter("age = 22").show()# Another way to write the above query is belowswimmers.select(swimmers.id, swimmers.age).filter(swimmers.age == 22).show()The output of this query is to choose only the id and age columns,where age = 22:https://www.iteblog.comIf we only want to get back the name of the swimmers who have an eyecolor that begins with the letter b, we can use a SQL-like syntax, like,as shown in the following code:# Get the name, eyeColor where eyeColor like 'b%'swimmers.select("name", "eyeColor").filter("eyeColor like 'b%'").show()The output is as follows:https://www.iteblog.comQuerying with SQLLet's run the same queries, except this time, we will do so using SQLqueries against the same DataFrame. Recall that this DataFrame isaccessible because we executed the .createOrReplaceTempView methodfor swimmers.Number of rowsThe following is the code snippet to get the number of rows within yourDataFrame using SQL:spark.sql("select count(1) from swimmers").show()The output is as follows:Running filter statements using the whereClausesTo run a filter statement using SQL, you can use the where clause, asnoted in the following code snippet:# Get the id, age where age = 22 in SQLspark.sql("select id, age from swimmers where age = 22").show()The output of this query is to choose only the id and age columns whereage = 22:https://www.iteblog.comAs with the DataFrame API querying, if we want to get back the nameof the swimmers who have an eye color that begins with the letter bonly, we can use the like syntax as well:spark.sql("select name, eyeColor from swimmers where eyeColor like 'b%'").show()The output is as follows:TipFor more information, please refer to the Spark SQL, DataFrames, andDatasets Guide at http://bit.ly/2cd1wyx.NoteAn important note when working with Spark SQL and DataFrames ishttps://www.iteblog.comhttp://bit.ly/2cd1wyxthat while it is easy to work with CSV, JSON, and a variety of dataformats, the most common storage format for Spark SQL analyticsqueries is the Parquet file format. It is a columnar format that issupported by many other data processing systems and Spark SQLsupports both reading and writing Parquet files that automaticallypreserves the schema of the original data. For more information, pleaserefer to the latest Spark SQL Programming Guide > Parquet Files at:http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files. Also, there are many performanceoptimizations that pertain to Parquet, including (but not limited to)Automatic Partition Discovery and Schema Migration for Parquet athttps://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html and How Apache Spark performs a fast count usingthe parquet metadata athttps://github.com/dennyglee/databricks/blob/master/misc/parquet-count-metadata-explanation.md.https://www.iteblog.comhttp://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-fileshttps://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.htmlhttps://github.com/dennyglee/databricks/blob/master/misc/parquet-count-metadata-explanation.mdDataFrame scenario – on-timeflight performanceTo showcase the types of queries you can do with DataFrames, let's lookat the use case of on-time flight performance. We will analyze theAirline On-Time Performance and Causes of Flight Delays: On-TimeData (http://bit.ly/2ccJPPM), and join this with the airports dataset,obtained from the Open Flights Airport, airline, and route data(http://bit.ly/2ccK5hw), to better understand the variables associatedwith flight delays.TipFor this section, we will be using Databricks Community Edition (a freeoffering of the Databricks product), which you can get athttps://databricks.com/try-databricks. We will be using visualizationsand pre-loaded datasets within Databricks to make it easier for you tofocus on writing the code and analyzing the results.If you would prefer to run this on your own environment, you can findthe datasets available in our GitHub repository for this book athttps://github.com/drabastomek/learningPySpark.Preparing the source datasetsWe will first process the source airports and flight performance datasetsby specifying their file path location and importing them usingSparkSession:# Set File PathsflightPerfFilePath = "/databricks-datasets/flights/departuredelays.csv"airportsFilePath = "/databricks-datasets/flights/airport-codes-na.txt"# Obtain Airports datasetairports = spark.read.csv(airportsFilePath, header='true', https://www.iteblog.comhttp://bit.ly/2ccJPPMhttp://bit.ly/2ccK5hwhttps://databricks.com/try-databrickshttps://github.com/drabastomek/learningPySparkinferSchema='true', sep='\t')airports.createOrReplaceTempView("airports")# Obtain Departure Delays datasetflightPerf = spark.read.csv(flightPerfFilePath, header='true')flightPerf.createOrReplaceTempView("FlightPerformance")# Cache the Departure Delays dataset flightPerf.cache()Note that we're importing the data using the CSV reader(com.databricks.spark.csv), which works for any specified delimiter(note that the airports data is tab-delimited, while the flight performancedata is comma-delimited). Finally, we cache the flight dataset sosubsequent queries will be faster.Joining flight performance and airportsOne of the more common tasks with DataFrames/SQL is to join twodifferent datasets; it is often one of the more demanding operations(from a performance perspective). With DataFrames, a lot of theperformance optimizations for these joins are included by default:# Query Sum of Flight Delays by City and Origin Code # (for Washington State)spark.sql("""select a.City, f.origin, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.originwhere a.State = 'WA'group by a.City, f.originorder by sum(f.delay) desc""").show()In our scenario, we are querying the total delays by city and origin codefor the state of Washington. This will require joining the flightperformance data with the airports data by International AirTransport Association (IATA) code. The output of the query is asfollows:https://www.iteblog.comUsing notebooks (such as Databricks, iPython, Jupyter, and ApacheZeppelin), you can more easily execute and visualize your queries. Inthe following examples, we will be using the Databricks notebook.Within our Python notebook, we can use the %sql function to executeSQL statements within that notebook cell:%sql-- Query Sum of Flight Delays by City and Origin Code (for Washington State)select a.City, f.origin, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.State = 'WA' group by a.City, f.origin order by sum(f.delay) descThis is the same as the previous query, but due to formatting, easier toread. In our Databricks notebook example, we can quickly visualize thisdata into a bar chart:https://www.iteblog.comVisualizing our flight-performance dataLet's continue visualizing our data, but broken down by all states in thecontinental US:%sql-- Query Sum of Flight Delays by State (for the US)select a.State, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.Country = 'USA' group by a.StateThe output bar chart is as follows:https://www.iteblog.comBut, it would be cooler to view this data as a map; click on the bar charticon at the bottom-left of the chart, and you can choose from manydifferent native navigations, including a map:One of the key benefits of DataFrames is that the information isstructured similar to a table. Therefore, whether you are using notebooksor your favorite BI tool, you will be able to quickly visualize your data.TipYou can find the full list of pyspark.sql.DataFrame methods athttp://bit.ly/2bkUGnT.You can find the full list of pyspark.sql.functions athttp://bit.ly/2bTAzLT.https://www.iteblog.comhttp://bit.ly/2bkUGnThttp://bit.ly/2bTAzLTSpark Dataset APIAfter this discussion about Spark DataFrames, let's have a quick recapof the Spark Dataset API. Introduced in Apache Spark 1.6, the goal ofSpark Datasets was to provide an API that allows users to easily expresstransformations on domain objects, while also providing theperformance and benefits of the robust Spark SQL execution engine. Aspart of the Spark 2.0 release (and as noted in the diagram below), theDataFrame APIs is merged into the Dataset API thus unifying dataprocessing capabilities across all libraries. Because of this unification,developers now have fewer concepts to learn or remember, and workwith a single high-level and type-safe API – called Dataset:Conceptually, the Spark DataFrame is an alias for a collection of genericobjects Dataset[Row], where a Row is a generic untyped JVM object.Dataset, by contrast, is a collection of strongly-typed JVM objects,dictated by a case class you define, in Scala or Java. This last point isparticularly important as this means that the Dataset API is notsupported by PySpark due to the lack of benefit from the typeenhancements. Note, for the parts of the Dataset API that are nothttps://www.iteblog.comavailable in PySpark, they can be accessed by converting to an RDD orby using UDFs. For more information, please refer to the jira [SPARK-13233]: Python Dataset at http://bit.ly/2dbfoFT.https://www.iteblog.comhttp://bit.ly/2dbfoFTSummaryWith Spark DataFrames, Python developers can make use of a simplerabstraction layer that is also potentially significantly faster. One of themain reasons Python is initially slower within Spark is due to thecommunication layer between Python sub-processes and the JVM. ForPython DataFrame users, we have a Python wrapper around ScalaDataFrames that avoids the Python sub-process/JVM communicationoverhead. Spark DataFrames has many performance enhancementsthrough the Catalyst Optimizer and Project Tungsten which we havereviewed in this chapter. In this chapter, we also reviewed how to workwith Spark DataFrames and worked on an on-time flight performancescenario using DataFrames.In this chapter, we created and worked with DataFrames by generatingthe data or making use of existing datasets.In the next chapter, we will discuss how to transform and understandyour own data.https://www.iteblog.comChapter 4. Prepare Data forModelingAll data is dirty, irrespective of what the source of the data might leadyou to believe: it might be your colleague, a telemetry system thatmonitors your environment, a dataset you download from the web, orsome other source. Until you have tested and proven to yourself thatyour data is in a clean state (we will get to what clean state means in asecond), you should neither trust it nor use it for modeling.Your data can be stained with duplicates, missing observations andoutliers, non-existent addresses, wrong phone numbers and area codes,inaccurate geographical coordinates, wrong dates, incorrect labels,mixtures of upper and lower cases, trailing spaces, and many other moresubtle problems. It is your job to clean it, irrespective of whether you area data scientist or data engineer, so you can build a statistical or machinelearning model.Your dataset is considered technically clean if none of theaforementioned problems can be found. However, to clean the datasetfor modeling purposes, you also need to check the distributions of yourfeatures and confirm they fit the predefined criteria.As a data scientist, you can expect to spend 80-90% of your timemassaging your data and getting familiar with all the features. Thischapter will guide you through that process, leveraging Sparkcapabilities.In this chapter, you will learn how to do the following:Recognize and handle duplicates, missing observations, and outliersCalculate descriptive statistics and correlationsVisualize your data with matplotlib and BokehChecking for duplicates, missinghttps://www.iteblog.comobservations, and outliersUntil you have fully tested the data and proven it worthy of yourtime,you should neither trust it nor use it. In this section, we will show youhow to deal with duplicates, missing observations, and outliers.DuplicatesDuplicates are observations that appear as distinct rows in your dataset,but which, upon closer inspection, look the same. That is, if you lookedat them side by side, all the features in these two (or more) rows wouldhave exactly the same values.On the other hand, if your data has some form of an ID to distinguishbetween records (or associate them with certain users, for example),then what might initially appear as a duplicate may not be; sometimessystems fail and produce erroneous IDs. In such a situation, you need toeither check whether the same ID is a real duplicate, or you need tocome up with a new ID system.Consider the following example:df = spark.createDataFrame([ (1, 144.5, 5.9, 33, 'M'), (2, 167.2, 5.4, 45, 'M'), (3, 124.1, 5.2, 23, 'F'), (4, 144.5, 5.9, 33, 'M'), (5, 133.2, 5.7, 54, 'F'), (3, 124.1, 5.2, 23, 'F'), (5, 129.2, 5.3, 42, 'M'), ], ['id', 'weight', 'height', 'age', 'gender'])As you can see, we have several issues here:We have two rows with IDs equal to 3 and they are exactly the sameRows with IDs 1 and 4 are the same — the only thing that's differentis their IDs, so we can safely assume that they are the same personWe have two rows with IDs equal to 5, but that seems to be arecording issue, as they do not seem to be the same personhttps://www.iteblog.comThis is a very easy dataset with only seven rows. What do you do whenyou have millions of observations? The first thing I normally do is tocheck if I have any duplicates: I compare the counts of the full datasetwith the one that I get after running a .distinct() method:print('Count of rows: {0}'.format(df.count()))print('Count of distinct rows: {0}'.format(df.distinct().count()))Here's what you get back for our DataFrame:If these two numbers differ, then you know you have, what I like to call,pure duplicates: rows that are exact copies of each other. We can dropthese rows by using the .dropDuplicates(...) method:df = df.dropDuplicates()Our dataset will then look as follows (once you run df.show()):https://www.iteblog.comWe dropped one of the rows with ID 3. Now let's check whether thereare any duplicates in the data irrespective of ID. We can quickly repeatwhat we have done earlier, but using only columns other than the IDcolumn:print('Count of ids: {0}'.format(df.count()))print('Count of distinct ids: {0}'.format( df.select([ c for c in df.columns if c != 'id' ]).distinct().count()))We should see one more row that is a duplicate:We can still use the .dropDuplicates(...), but will add the subsetparameter that specifies only the columns other than the id column:df = df.dropDuplicates(subset=[ c for c in df.columns if c != 'id'])The subset parameter instructs the .dropDuplicates(...) method tolook for duplicated rows using only the columns specified via the subsetparameter; in the preceding example, we will drop the duplicatedrecords with the same weight, height, age, and gender but not id.Running the df.show(), we get the following cleaner dataset as wedropped the row with id = 1 since it was identical to the record with id= 4:https://www.iteblog.comNow that we know there are no full rows duplicated, or any identicalrows differing only by ID, let's check if there are any duplicated IDs. Tocalculate the total and distinct number of IDs in one step, we can use the.agg(...) method:import pyspark.sql.functions as fndf.agg( fn.count('id').alias('count'), fn.countDistinct('id').alias('distinct')).show()Here's the output of the preceding code:https://www.iteblog.comIn the previous example, we first import all the functions from thepyspark.sql module.TipThis gives us access to a vast array of various functions, too many to listhere. However, we strongly encourage you to study the PySpark'sdocumentation athttp://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions.Next, we use the .count(...) and .countDistinct(...) to,respectively, calculate the number of rows and the number of distinctids in our DataFrame. The .alias(...) method allows us to specify afriendly name to the returned column.As you can see, we have five rows in total, but only four distinct IDs.Since we have already dropped all the duplicates, we can safely assumethat this might just be a fluke in our ID data, so we will give each row aunique ID:df.withColumn('new_id', fn.monotonically_increasing_id()).show()The preceding code snippet produced the following output:https://www.iteblog.comhttp://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functionsThe .monotonicallymonotonically_increasing_id() method giveseach record a unique and increasing ID. According to thedocumentation, as long as your data is put into less than roughly 1 billionpartitions with less than 8 billions records in each, the ID is guaranteedto be unique.NoteA word of caution: in earlier versions of Spark the.monotonicallymonotonically_increasing_id() method would notnecessarily return the same IDs across multiple evaluations of the sameDataFrame. This, however, has been fixed in Spark 2.0.Missing observationsYou will frequently encounter datasets with blanks in them. The missingvalues can happen for a variety of reasons: systems failure, people error,data schema changes, just to name a few.The simplest way to deal with missing values, if your data can afford it,is to drop the whole observation when any missing value is found. Youhave to be careful not to drop too many: depending on the distributionof the missing values across your dataset it might severely affect theusability of your dataset. If, after dropping the rows, I end up with avery small dataset, or find that the reduction in data size is more than50%, I start checking my data to see what features have the most holesin them and perhaps exclude those altogether; if a feature has most of itsvalues missing (unless a missing value bears a meaning), from amodeling point of view, it is fairly useless.The other way to deal with the observations with missing values is toimpute some value in place of those Nones. Given the type of your data,you have several options to choose from:If your data is a discrete Boolean, you can turn it into a categoricalvariable by adding a third category — MissingIf your data is already categorical, you can simply extend thenumber of levels and add the Missing category as wellhttps://www.iteblog.comIf you're dealing with ordinal or numerical data, you can imputeeither mean, median, or some other predefined value (for example,first or third quartile, depending on the distribution shape of yourdata)Consider a similar example to the one we presented previously:df_miss = spark.createDataFrame([ (1, 143.5, 5.6, 28, 'M', 100000), (2, 167.2, 5.4, 45, 'M', None), (3, None , 5.2, None, None, None), (4, 144.5, 5.9, 33, 'M', None), (5, 133.2, 5.7, 54, 'F', None), (6, 124.1, 5.2, None, 'F', None), (7, 129.2, 5.3, 42, 'M', 76000), ], ['id', 'weight', 'height', 'age', 'gender', 'income'])In our example, we deal with a number of missing values categories.Analyzing rows, we see the following:The row with ID 3 has only one useful piece of information—theheightThe row with ID 6 has only one missing value—the ageAnalyzing columns, we can see the following:The income column, since it is a very personal thing to disclose, hasmost of its values missingThe weight and gender columns have only one missing value eachThe age column has two missing valuesTo find the number of missing observations per row, we can use thefollowing snippet:df_miss.rdd.map( lambda row: (row['id'], sum([c== None for c in row]))).collect()It produces the following output:https://www.iteblog.comIt tells us that, for example, the row with ID 3 has four missingobservations, as we observed earlier.Let's see what values are missing so that when we count missingobservations in columns, we can decide whether to drop the observationaltogether or impute some of the observations:df_miss.where('id == 3').show()Here's what we get:Let's now check what percentage of missing observations are there ineach column:df_miss.agg(*[ (1 - (fn.count(c) / fn.count('*'))).alias(c + '_missing') for c in df_miss.columns]).show()This generates the following output:https://www.iteblog.comNoteThe * argument to the .count(...) method (in place of a column name)instructs the method to count all rows. On the other hand, the *preceding the list declaration instructs the .agg(...) method to treat thelist as a set of separate parameters passed to the function.So, we have 14% of missing observations in the weight and gendercolumns, twice as much in the height column, and almost 72% ofmissing observations in the income column. Now we know what to do.First, we will drop the 'income' feature, as most of its values aremissing.df_miss_no_income = df_miss.select([ c for c in df_miss.columns if c != 'income'])We now see that we do not need to drop the row with ID 3 as thecoverage in the 'weight' and 'age' columns has enough observations(in our simplified example) to calculate the mean and impute it in theplace of the missing values.However, if you decide to drop the observations instead, you can use the.dropna(...) method, as shown here. Here, we will also use the threshparameter, which allows us to specify a threshold on the number ofmissing observations per row that would qualify the row to be dropped.This is useful if you have a dataset with tens or hundreds of features andyou only want to drop those rows that exceed a certain threshold ofmissing values:df_miss_no_income.dropna(thresh=3).show()The preceding code produces the following output:https://www.iteblog.comOn the other hand, if you wanted to impute the observations, you canuse the .fillna(...) method. This method accepts a single integer (longis also accepted), float, or string; all missing values in the whole datasetwill then be filled in with that value. You can also pass a dictionary of aform {'': }. This has the same limitation,in that, as the , you can only pass an integer, float, orstring.If you want to impute a mean, median, or other calculated value, youneed to first calculate the value, create a dictionary with such values,and then pass it to the .fillna(...) method.Here's how we do it:means = df_miss_no_income.agg( *[fn.mean(c).alias(c) for c in df_miss_no_income.columns if c != 'gender']).toPandas().to_dict('records')[0]means['gender'] = 'missing'df_miss_no_income.fillna(means).show()https://www.iteblog.comThe preceding code will produce the following output:We omit the gender column as one cannot calculate a mean of acategorical variable, obviously.We use a double conversion here. Taking the output of the .agg(...)method (a PySpark DataFrame), we first convert it into a pandas'DataFrame and then once more to a dictionary.TipNote that calling the .toPandas() can be problematic, as the methodworks essentially in the same way as .collect() in RDDs. It collects allthe information from the workers and brings it over to the driver. It isunlikely to be a problem with the preceding dataset, unless you havethousands upon thousands of features.The records parameter to the .to_dict(...) method of pandas instructsit to create the following dictionary:https://www.iteblog.comSince we cannot calculate the average (or any other numeric metric of acategorical variable), we added the missing category to the dictionaryfor the gender feature. Note that, even though the mean of the agecolumn is 40.40, when imputed, the type of the df_miss_no_income.agecolumn was preserved—it is still an integer.OutliersOutliers are those observations that deviate significantly from thedistribution of the rest of your sample. The definitions of significancevary, but in the most general form, you can accept that there are nooutliers if all the values are roughly within the Q1−1.5IQR andQ3+1.5IQR range, where IQR is the interquartile range; the IQR isdefined as a difference between the upper- and lower-quartiles, that is,the 75th percentile (the Q3) and 25th percentile (the Q1), respectively.Let's, again, consider a simple example:df_outliers = spark.createDataFrame([ (1, 143.5, 5.3, 28), (2, 154.2, 5.5, 45), (3, 342.3, 5.1, 99), (4, 144.5, 5.5, 33), (5, 133.2, 5.4, 54), (6, 124.1, 5.1, 21), (7, 129.2, 5.3, 42), ], ['id', 'weight', 'height', 'age'])Now we can use the definition we outlined previously to flag thehttps://www.iteblog.comoutliers.First, we calculate the lower and upper cut off points for each feature.We will use the .approxQuantile(...) method. The first parameterspecified is the name of the column, the second parameter can be eithera number between 0 or 1 (where 0.5 means to calculated median) or alist (as in our case), and the third parameter specifies the acceptablelevel of an error for each metric (if set to 0, it will calculate an exactvalue for the metric, but it can be really expensive to do so):cols = ['weight', 'height', 'age']bounds = {}for col in cols: quantiles = df_outliers.approxQuantile( col, [0.25, 0.75], 0.05 ) IQR = quantiles[1] - quantiles[0] bounds[col] = [ quantiles[0] - 1.5 * IQR, quantiles[1] + 1.5 * IQR]The bounds dictionary holds the lower and upper bounds for eachfeature:Let's now use it to flag our outliers:outliers = df_outliers.select(*['id'] + [ ( (df_outliers[c] bounds[c][1]) ).alias(c + '_o') for c in colshttps://www.iteblog.com])outliers.show()The preceding code produces the following output:We have two outliers in the weight feature and two in the age feature.By now you should know how to extract these, but here is a snippet thatlists the values significantly differing from the rest of the distribution:df_outliers = df_outliers.join(outliers, on='id')df_outliers.filter('weight_o').select('id', 'weight').show()df_outliers.filter('age_o').select('id', 'age').show()The preceding code will give you the following output:https://www.iteblog.comEquipped with the methods described in this section, you can quicklyclean up even the biggest of datasets.https://www.iteblog.comGetting familiar with your dataAlthough we would strongly discourage such behavior, you can build amodel without knowing your data; it will most likely take you longer,and the quality of the resulting model might be less than optimal, but it isdoable.NoteIn this section, we will use the dataset we downloaded fromhttp://packages.revolutionanalytics.com/datasets/ccFraud.csv. We didnot alter the dataset itself, but it was GZipped and uploaded tohttp://tomdrabas.com/data/LearningPySpark/ccFraud.csv.gz. Pleasedownload the file first and save it in the same folder that contains yournotebook for this chapter.The head of the dataset looks as follows:Thus, any serious data scientist or data modeler will become acquaintedwith the dataset before starting any modeling. As a first thing, wenormally start with some descriptive statistics to get a feeling for whatwe are dealing with.Descriptive statisticsDescriptive statistics, in the simplest sense, will tell you the basicinformation about your dataset: how many non-missing observationsthere are in your dataset, the mean and the standard deviation for thecolumn, as well as the min and max values.https://www.iteblog.comhttp://packages.revolutionanalytics.com/datasets/ccFraud.csvhttp://tomdrabas.com/data/LearningPySpark/ccFraud.csv.gzHowever, first things first—let's load our data and convert it to a SparkDataFrame:import pyspark.sql.types as typFirst, we load the only module we will need. The pyspark.sql.typesexposes all the data types we can use, such as IntegerType() orFloatType().NoteFor a full list of available types checkhttp://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types.Next, we read the data in and remove the header line using the.filter(...) method. This is followed by splitting the row on eachcomma (since this is a .csv file) and converting each element to aninteger:fraud = sc.textFile('ccFraud.csv.gz')header = fraud.first()fraud = fraud \ .filter(lambda row: row != header) \ .map(lambda row: [int(elem) for elem in row.split(',')])Next, we create the schema for our DataFrame:fields = [ *[ typ.StructField(h[1:-1], typ.IntegerType(), True) for h in header.split(',') ]]schema = typ.StructType(fields)Finally, we create our DataFrame:fraud_df = spark.createDataFrame(fraud, schema)Having created our fraud_df DataFrame, we can calculate the basichttps://www.iteblog.comhttp://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.typesdescriptive statistics for our dataset. However, you need to rememberthat even though all of our features appear as numeric in nature, some ofthem are categorical (for example, gender or state).Here's the schema of our DataFrame:fraud_df.printSchema()The representation is shown here:Also, no information would be gained from calculating the mean andstandard deviation of the custId column, so we will not be doing that.For a better understanding of categorical columns, we will count thefrequencies of their values using the .groupby(...) method. In thisexample, we will count the frequencies of the gender column:fraud_df.groupby('gender').count().show()The preceding code will produce the following output:https://www.iteblog.comAs you can see, we are dealing with a fairly imbalanced dataset. Whatyou would expect to see is an equal distribution for both genders.NoteIt goes beyond the scope of this chapter, but if we were building astatistical model, you would need to take care of these kinds of biases.You can read more athttp://www.va.gov/VETDATA/docs/SurveysAndStudies/SAMPLE_WEIGHT.pdfFor the truly numerical features, we can use the .describe() method:numerical = ['balance', 'numTrans', 'numIntlTrans']desc = fraud_df.describe(numerical)desc.show()The .show() method will produce the following output:https://www.iteblog.comhttp://www.va.gov/VETDATA/docs/SurveysAndStudies/SAMPLE_WEIGHT.pdfEven from these relatively few numbers we can tell quite a bit:All of the features are positively skewed. The maximum values are anumber of times larger than the average.The coefficient of variation (the ratio of mean to standard deviation)is very high (close or greater than 1), suggesting a wide spread ofobservations.Here's how you check the skeweness (we will do it for the 'balance'feature only):fraud_df.agg({'balance': 'skewness'}).show()The preceding code produces the following output:A list of aggregation functions (the names are fairly self-explanatory)includes: avg(), count(), countDistinct(), first(), kurtosis(),max(), mean(), min(), skewness(), stddev(), stddev_pop(),stddev_samp(), sum(), sumDistinct(), var_pop(), var_samp() andvariance().CorrelationsAnother highly useful measure of mutual relationships between featuresis correlation. Your model would normally include only those featuresthat are highly correlated with your target. However, it is almost equallyimportant to check the correlation between the features; includingfeatures that are highly correlated among them (that is, are collinear)may lead to unpredictable behavior of your model, or mighthttps://www.iteblog.comunnecessarily complicate it.NoteI talk more about multicollinearity in my other book, Practical DataAnalysis Cookbook, Packt Publishing (https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook), inChapter 5, Introducing MLlib, under the section titled Identifying andtackling multicollinearity.Calculating correlations in PySpark is very easy once your data is in aDataFrame form. The only difficulties are that the .corr(...) methodsupports the Pearson correlation coefficient at the moment, and it canonly calculate pairwise correlations, such as the following:fraud_df.corr('balance', 'numTrans')In order to create a correlations matrix, you can use the following script:n_numerical = len(numerical)corr = []for i in range(0, n_numerical): temp = [None] * i for j in range(i, n_numerical): temp.append(fraud_df.corr(numerical[i], numerical[j])) corr.append(temp)The preceding code will create the following output:As you can see, the correlations between the numerical features in thecredit card fraud dataset are pretty much non-existent. Thus, all thesefeatures can be used in our models, should they turn out to behttps://www.iteblog.comhttps://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook)statistically sound in explaining our target.Having checked the correlations, we can now move on to visuallyinspecting our data.https://www.iteblog.comVisualizationThere are multiple visualization packages, but in this section we will beusing matplotlib and Bokeh exclusively to give you the best tools foryour needs.Both of the packages come preinstalled with Anaconda. First, let's loadthe modules and set them up:%matplotlib inlineimport matplotlib.pyplot as pltplt.style.use('ggplot')import bokeh.charts as chrtfrom bokeh.io import output_notebookoutput_notebook()The %matplotlib inline and the output_notebook() commands willmake every chart generated with matplotlib or Bokeh, respectively,appear within the notebook and not as a separate window.HistogramsHistograms are by far the easiest way to visually gauge the distributionof your features. There are three ways you can generate histograms inPySpark (or a Jupyter notebook):Aggregate the data in workers and return an aggregated list of binsand counts in each bin of the histogram to the driverReturn all the data points to the driver and allow the plottinglibraries' methods to do the job for youSample your data and then return them to the driver for plotting.If the number of rows in your dataset is counted in billions, then thesecond option might not be attainable. Thus, you need to aggregate thedata first:hists = fraud_df.select('balance').rdd.flatMap(https://www.iteblog.com lambda row: row).histogram(20)To plot the histogram, you can simply call matplotlib, as shown in thefollowing code:data = { 'bins': hists[0][:-1], 'freq': hists[1]}plt.bar(data['bins'], data['freq'], width=2000)plt.title('Histogram of \'balance\'')This will produce the following chart:In a similar manner, a histogram can be created with Bokeh:b_hist = chrt.Bar(https://www.iteblog.com data, values='freq', label='bins', title='Histogram of \'balance\'')chrt.show(b_hist)Since Bokeh uses D3.js in the background, the resulting chart isinteractive:If your data is small enough to fit on the driver (although we wouldargue it would normally be faster to use the previous method), you canbring the data and use the .hist(...) (from matplotlib) or.Histogram(...) (from Bokeh) methods:data_driver = { 'obs': fraud_df.select('balance').rdd.flatMap( lambda row: row ).collect()}https://www.iteblog.complt.hist(data_driver['obs'], bins=20)plt.title('Histogram of \'balance\' using .hist()')b_hist_driver = chrt.Histogram( data_driver, values='obs', title='Histogram of \'balance\'using .Histogram()', bins=20)chrt.show(b_hist_driver)This will produce the following chart for matplotlib:For Bokeh, the following chart will be generated:https://www.iteblog.comInteractions between featuresScatter charts allow us to visualize interactions between up to threevariables at a time (although we will be only presenting a 2D interactionin this section).TipYou should rarely revert to 3D visualizations unless you are dealing withsome temporal data and you want to observe changes over time. Eventhen, we would rather discretize the time data and present a series of 2Dcharts, as interpreting 3D charts is somewhat more complicated and(most of the time) confusing.Since PySpark does not offer any visualization modules on the serverside, and trying to plot billions of observations at the same time wouldhttps://www.iteblog.combe highly impractical, in this section we will sample the dataset at 0.02%(roughly 2,000 observations).TipUnless you chose a stratified sampling, you should create at least threeto five samples at a predefined sampling fraction so you can check ifyour sample is somewhat representative of your dataset—that is, thatthe differences between your samples are not big.In this example, we will sample our fraud dataset at 0.02% given'gender' as a strata:data_sample = fraud_df.sampleBy( 'gender', {1: 0.0002, 2: 0.0002}).select(numerical)To put multiple 2D charts in one go, you can use the following code:data_multi = dict([ (elem, data_sample.select(elem).rdd \ .flatMap(lambda row: row).collect()) for elem in numerical])sctr = chrt.Scatter(data_multi, x='balance', y='numTrans')chrt.show(sctr)The preceding code will produce the following chart:https://www.iteblog.comAs you can see, there are plenty of fraudulent transactions that had 0balance but many transactions—that is, a fresh card and big spike oftransactions. However, no specific pattern can be shown apart fromsome banding occurring at $1,000 intervals.https://www.iteblog.comSummaryIn this chapter, we looked at how to clean and prepare your dataset formodeling by identifying and tackling datasets with missing values,duplicates, and outliers. We also looked at how to get a bit more familiarwith your data using tools from PySpark (although this is by no means afull manual on how to analyze your datasets). Finally, we showed youhow to chart your data.We will use these (and more) techniques in the next two chapters, wherewe will be building machine learning models.https://www.iteblog.comChapter 5. Introducing MLlibIn the previous chapter, we learned how to prepare the data formodeling. In this chapter, we will actually use some of that learning tobuild a classification model using the MLlib package of PySpark.MLlib stands for Machine Learning Library. Even though MLlib is nowin a maintenance mode, that is, it is not actively being developed (andwill most likely be deprecated later), it is warranted that we cover atleast some of the features of the library. In addition, MLlib is currentlythe only library that supports training models for streaming.NoteStarting with Spark 2.0, ML is the main machine learning library thatoperates on DataFrames instead of RDDs as is the case for MLlib.The documentation for MLlib can be found here:http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html.In this chapter, you will learn how to do the following:Prepare the data for modeling with MLlibPerform statistical testingPredict survival chances of infants using logistic regressionSelect the most predictable features and train a random forest modelOverview of the packageAt the high level, MLlib exposes three core machine learningfunctionalities:Data preparation: Feature extraction, transformation, selection,hashing of categorical features, and some natural languageprocessing methodsMachine learning algorithms: Some popular and advancedregression, classification, and clustering algorithms are implementedhttps://www.iteblog.comhttp://spark.apache.org/docs/latest/api/python/pyspark.mllib.htmlUtilities: Statistical methods such as descriptive statistics, chi-square testing, linear algebra (sparse and dense matrices andvectors), and model evaluation methodsAs you can see, the palette of available functionalities allows you toperform almost all of the fundamental data science tasks.In this chapter, we will build two classification models: a linearregression and a random forest. We will use a portion of the US 2014and 2015 birth data we downloaded fromhttp://www.cdc.gov/nchs/data_access/vitalstatsonline.htm; from thetotal of 300 variables we selected 85 features that we will use to buildour models. Also, out of the total of almost 7.99 million records, weselected a balanced sample of 45,429 records: 22,080 records whereinfants were reported dead and 23,349 records with infants alive.TipThe dataset we will use in this chapter can be downloaded fromhttp://www.tomdrabas.com/data/LearningPySpark/births_train.csv.gz.https://www.iteblog.comhttp://www.cdc.gov/nchs/data_access/vitalstatsonline.htmhttp://www.tomdrabas.com/data/LearningPySpark/births_train.csv.gzLoading and transforming thedataEven though MLlib is designed with RDDs and DStreams in focus, forease of transforming the data we will read the data and convert it to aDataFrame.NoteThe DStreams are the basic data abstraction for Spark Streaming (seehttp://bit.ly/2jIDT2A)Just like in the previous chapter, we first specify the schema of ourdataset.NoteNote that here (for brevity), we only present a handful of features. Youshould always check our GitHub account for this book for the latestversion of the code: https://github.com/drabastomek/learningPySpark.Here's the code:import pyspark.sql.types as typlabels = [ ('INFANT_ALIVE_AT_REPORT', typ.StringType()), ('BIRTH_YEAR', typ.IntegerType()), ('BIRTH_MONTH', typ.IntegerType()), ('BIRTH_PLACE', typ.StringType()), ('MOTHER_AGE_YEARS', typ.IntegerType()), ('MOTHER_RACE_6CODE', typ.StringType()), ('MOTHER_EDUCATION', typ.StringType()), ('FATHER_COMBINED_AGE', typ.IntegerType()), ('FATHER_EDUCATION', typ.StringType()), ('MONTH_PRECARE_RECODE', typ.StringType()), ... ('INFANT_BREASTFED', typ.StringType())]schema = typ.StructType([ typ.StructField(e[0], e[1], False) for e in labelshttps://www.iteblog.comhttp://bit.ly/2jIDT2Ahttps://github.com/drabastomek/learningPySpark ])Next, we load the data. The .read.csv(...) method can read eitheruncompressed or (as in our case) GZipped comma-separated values. Theheader parameter set to True indicates that the first row contains theheader, and we use the schema to specify the correct data types:births = spark.read.csv('births_train.csv.gz', header=True, schema=schema)There are plenty of features in our dataset that are strings. These aremostly categorical variables that we need to somehow convert to anumeric form.TipYou can glimpse over the original file schema specification here:ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/UserGuide2015.pdfWe will first specify our recode dictionary:recode_dictionary = { 'YNU': { 'Y': 1, 'N': 0, 'U': 0 }}Our goal in this chapter is to predict whether the'INFANT_ALIVE_AT_REPORT' is either 1 or 0. Thus, we will drop all of thefeatures that relate to the infant and will try to predict the infant'schances of surviving only based on the features related to its mother,father, and the place of birth:selected_features = [ 'INFANT_ALIVE_AT_REPORT', 'BIRTH_PLACE', 'MOTHER_AGE_YEARS', 'FATHER_COMBINED_AGE', 'CIG_BEFORE', https://www.iteblog.comftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/UserGuide2015.pdf 'CIG_1_TRI','CIG_2_TRI', 'CIG_3_TRI', 'MOTHER_HEIGHT_IN', 'MOTHER_PRE_WEIGHT', 'MOTHER_DELIVERY_WEIGHT', 'MOTHER_WEIGHT_GAIN', 'DIABETES_PRE', 'DIABETES_GEST', 'HYP_TENS_PRE', 'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM']births_trimmed = births.select(selected_features)In our dataset, there are plenty of features with Yes/No/Unknownvalues; we will only code Yes to 1; everything else will be set to 0.There is also a small problem with how the number of cigarettes smokedby the mother was coded: as 0 means the mother smoked no cigarettesbefore or during the pregnancy, between 1-97 states the actual numberof cigarette smoked, 98 indicates either 98 or more, whereas 99identifies the unknown; we will assume the unknown is 0 and recodeaccordingly.So next we will specify our recoding methods:import pyspark.sql.functions as funcdef recode(col, key): return recode_dictionary[key][col] def correct_cig(feat): return func \ .when(func.col(feat) != 99, func.col(feat))\ .otherwise(0)rec_integer = func.udf(recode, typ.IntegerType())The recode method looks up the correct key from therecode_dictionary (given the key) and returns the corrected value. Thecorrect_cig method checks when the value of the feature feat is notequal to 99 and (for that situation) returns the value of the feature; if thevalue is equal to 99, we get 0 otherwise.https://www.iteblog.comWe cannot use the recode function directly on a DataFrame; it needs tobe converted to a UDF that Spark will understand. The rec_integer issuch a function: by passing our specified recode function and specifyingthe return value data type, we can use it then to encode ourYes/No/Unknown features.So, let's get to it. First, we'll correct the features related to the number ofcigarettes smoked:births_transformed = births_trimmed \ .withColumn('CIG_BEFORE', correct_cig('CIG_BEFORE'))\ .withColumn('CIG_1_TRI', correct_cig('CIG_1_TRI'))\ .withColumn('CIG_2_TRI', correct_cig('CIG_2_TRI'))\ .withColumn('CIG_3_TRI', correct_cig('CIG_3_TRI'))The .withColumn(...) method takes the name of the column as its firstparameter and the transformation as the second one. In the previouscases, we do not create new columns, but reuse the same ones instead.Now we will focus on correcting the Yes/No/Unknown features. First,we will figure out which these are with the following snippet:cols = [(col.name, col.dataType) for col in births_trimmed.schema]YNU_cols = []for i, s in enumerate(cols): if s[1] == typ.StringType(): dis = births.select(s[0]) \ .distinct() \ .rdd \ .map(lambda row: row[0]) \ .collect() if 'Y' in dis: YNU_cols.append(s[0])First, we created a list of tuples (cols) that hold column names andcorresponding data types. Next, we loop through all of these andcalculate distinct values of all string columns; if a 'Y' is within thereturned list, we append the column name to the YNU_cols list.DataFrames can transform the features in bulk while selecting features.https://www.iteblog.comTo present the idea, consider the following example:births.select([ 'INFANT_NICU_ADMISSION', rec_integer( 'INFANT_NICU_ADMISSION', func.lit('YNU') ) \ .alias('INFANT_NICU_ADMISSION_RECODE')] ).take(5)Here's what we get in return:We select the 'INFANT_NICU_ADMISSION' column and we pass the nameof the feature to the rec_integer method. We also alias the newlytransformed column as 'INFANT_NICU_ADMISSION_RECODE'. This way wewill also confirm that our UDF works as intended.So, to transform all the YNU_cols in one go, we will create a list of suchtransformations, as shown here:exprs_YNU = [ rec_integer(x, func.lit('YNU')).alias(x) if x in YNU_cols else x for x in births_transformed.columns]births_transformed = births_transformed.select(exprs_YNU)Let's check if we got it correctly:births_transformed.select(YNU_cols[-5:]).show(5)Here's what we get:https://www.iteblog.comLooks like everything worked as we wanted it to work, so let's get toknow our data better.https://www.iteblog.comGetting to know your dataIn order to build a statistical model in an informed way, an intimateknowledge of the dataset is necessary. Without knowing the data it ispossible to build a successful model, but it is then a much more arduoustask, or it would require more technical resources to test all the possiblecombinations of features. Therefore, after spending the required 80% ofthe time cleaning the data, we spend the next 15% getting to know it!Descriptive statisticsI normally start with descriptive statistics. Even though the DataFramesexpose the .describe() method, since we are working with MLlib, wewill use the .colStats(...) method.NoteA word of warning: the .colStats(...) calculates the descriptivestatistics based on a sample. For real world datasets this should notreally matter but if your dataset has less than 100 observations youmight get some strange results.The method takes an RDD of data to calculate the descriptive statistics ofand return a MultivariateStatisticalSummary object that contains thefollowing descriptive statistics:count(): This holds a row countmax(): This holds maximum value in the columnmean(): This holds the value of the mean for the values in thecolumnmin(): This holds the minimum value in the columnnormL1(): This holds the value of the L1-Norm for the values in thecolumnnormL2(): This holds the value of the L2-Norm for the values in thecolumnnumNonzeros(): This holds the number of nonzero values in thecolumnhttps://www.iteblog.comvariance(): This holds the value of the variance for the values inthe columnNoteYou can read more about the L1- and L2-norms here http://bit.ly/2jJJPJ0We recommend checking the documentation of Spark to learn moreabout these. The following is a snippet that calculates the descriptivestatistics of the numeric features:import pyspark.mllib.stat as stimport numpy as npnumeric_cols = ['MOTHER_AGE_YEARS','FATHER_COMBINED_AGE', 'CIG_BEFORE','CIG_1_TRI','CIG_2_TRI','CIG_3_TRI', 'MOTHER_HEIGHT_IN','MOTHER_PRE_WEIGHT', 'MOTHER_DELIVERY_WEIGHT','MOTHER_WEIGHT_GAIN' ]numeric_rdd = births_transformed\ .select(numeric_cols)\ .rdd \ .map(lambda row: [e for e in row])mllib_stats = st.Statistics.colStats(numeric_rdd)for col, m, v in zip(numeric_cols, mllib_stats.mean(), mllib_stats.variance()): print('{0}: \t{1:.2f} \t {2:.2f}'.format(col, m, np.sqrt(v)))The preceding code produces the following result:https://www.iteblog.comhttp://bit.ly/2jJJPJ0As you can see, mothers, compared to fathers, are younger: the averageage of mothers was 28 versus over 44 for fathers. A good indication (atleast for some of the infants) was that many mothers quit smoking whilebeing pregnant; it is horrifying, though, that there still were some thatcontinued smoking.For the categorical variables, we will calculate the frequencies of theirvalues:categorical_cols = [e for e in births_transformed.columns if e not in numeric_cols]categorical_rdd = births_transformed\ .select(categorical_cols)\ .rdd \ .map(lambda row: [e for e in row])for i, col in enumerate(categorical_cols): agg = categorical_rdd \ .groupBy(lambda row: row[i]) \ .map(lambda row: (row[0], len(row[1]))) print(col, sorted(agg.collect(), key=lambda el: el[1], reverse=True))Here is what the results look like:https://www.iteblog.comMost of the deliveries happened in hospital (BIRTH_PLACE equal to 1).Around 550 deliveries happenedat home: some intentionally('BIRTH_PLACE' equal to 3), and some not ('BIRTH_PLACE' equal to 4).CorrelationsCorrelations help to identify collinear numeric features and handle themappropriately. Let's check the correlations between our features:corrs = st.Statistics.corr(numeric_rdd)for i, el in enumerate(corrs > 0.5): correlated = [ (numeric_cols[j], corrs[i][j]) for j, e in enumerate(el) if e == 1.0 and j != i] if len(correlated) > 0: for e in correlated: print('{0}-to-{1}: {2:.2f}' \ .format(numeric_cols[i], e[0], e[1]))The preceding code will calculate the correlation matrix and will printonly those features that have a correlation coefficient greater than 0.5:the corrs > 0.5 part takes care of that.Here's what we get:https://www.iteblog.comAs you can see, the 'CIG_...' features are highly correlated, so we candrop most of them. Since we want to predict the survival chances of aninfant as soon as possible, we will keep only the 'CIG_1_TRI'. Also, asexpected, the weight features are also highly correlated and we will onlykeep the 'MOTHER_PRE_WEIGHT':features_to_keep = [ 'INFANT_ALIVE_AT_REPORT', 'BIRTH_PLACE', 'MOTHER_AGE_YEARS', 'FATHER_COMBINED_AGE', 'CIG_1_TRI', 'MOTHER_HEIGHT_IN', 'MOTHER_PRE_WEIGHT', 'DIABETES_PRE', 'DIABETES_GEST', 'HYP_TENS_PRE', 'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM'https://www.iteblog.com]births_transformed = births_transformed.select([e for e in features_to_keep])Statistical testingWe cannot calculate correlations for the categorical features. However,we can run a Chi-square test to determine if there are significantdifferences.Here's how you can do it using the .chiSqTest(...) method of MLlib:import pyspark.mllib.linalg as lnfor cat in categorical_cols[1:]: agg = births_transformed \ .groupby('INFANT_ALIVE_AT_REPORT') \ .pivot(cat) \ .count() agg_rdd = agg \ .rdd \ .map(lambda row: (row[1:])) \ .flatMap(lambda row: [0 if e == None else e for e in row]) \ .collect() row_length = len(agg.collect()[0]) - 1 agg = ln.Matrices.dense(row_length, 2, agg_rdd) test = st.Statistics.chiSqTest(agg) print(cat, round(test.pValue, 4))We loop through all the categorical variables and pivot them by the'INFANT_ALIVE_AT_REPORT' feature to get the counts. Next, wetransform them into an RDD, so we can then convert them into a matrixusing the pyspark.mllib.linalg module. The first parameter to the.Matrices.dense(...) method specifies the number of rows in thematrix; in our case, it is the length of distinct values of the categoricalfeature.The second parameter specifies the number of columns: we have two asour 'INFANT_ALIVE_AT_REPORT' target variable has only two values.The last parameter is a list of values to be transformed into a matrix.https://www.iteblog.comHere's an example that shows this more clearly:print(ln.Matrices.dense(3,2, [1,2,3,4,5,6]))The preceding code produces the following matrix:Once we have our counts in a matrix form, we can use the.chiSqTest(...) to calculate our test.Here's what we get in return:Our tests reveal that all the features should be significantly different andshould help us predict the chance of survival of an infant.https://www.iteblog.comCreating the final datasetTherefore, it is time to create our final dataset that we will use to buildour models. We will convert our DataFrame into an RDD ofLabeledPoints.A LabeledPoint is a MLlib structure that is used to train the machinelearning models. It consists of two attributes: label and features.The label is our target variable and features can be a NumPy array,list, pyspark.mllib.linalg.SparseVector,pyspark.mllib.linalg.DenseVector, or scipy.sparse column matrix.Creating an RDD of LabeledPointsBefore we build our final dataset, we first need to deal with one finalobstacle: our 'BIRTH_PLACE' feature is still a string. While any of theother categorical variables can be used as is (as they are now dummyvariables), we will use a hashing trick to encode the 'BIRTH_PLACE'feature:import pyspark.mllib.feature as ftimport pyspark.mllib.regression as reghashing = ft.HashingTF(7)births_hashed = births_transformed \ .rdd \ .map(lambda row: [ list(hashing.transform(row[1]).toArray()) if col == 'BIRTH_PLACE' else row[i] for i, col in enumerate(features_to_keep)]) \ .map(lambda row: [[e] if type(e) == int else e for e in row]) \ .map(lambda row: [item for sublist in row for item in sublist]) \ .map(lambda row: reg.LabeledPoint( row[0], ln.Vectors.dense(row[1:])) )https://www.iteblog.comFirst, we create the hashing model. Our feature has seven levels, so weuse as many features as that for the hashing trick. Next, we actually usethe model to convert our 'BIRTH_PLACE' feature into a SparseVector;such a data structure is preferred if your dataset has many columns butin a row only a few of them have non-zero values. We then combine allthe features together and finally create a LabeledPoint.Splitting into training and testingBefore we move to the modeling stage, we need to split our dataset intotwo sets: one we'll use for training and the other for testing. Luckily,RDDs have a handy method to do just that: .randomSplit(...). Themethod takes a list of proportions that are to be used to randomly splitthe dataset.Here is how it is done:births_train, births_test = births_hashed.randomSplit([0.6, 0.4])That's it! Nothing more needs to be done.https://www.iteblog.comPredicting infant survivalFinally, we can move to predicting the infants' survival chances. In thissection, we will build two models: a linear classifier—the logisticregression, and a non-linear one—a random forest. For the former one,we will use all the features at our disposal, whereas for the latter one, wewill employ a ChiSqSelector(...) method to select the top fourfeatures.Logistic regression in MLlibLogistic regression is somewhat a benchmark to build any classificationmodel. MLlib used to provide a logistic regression model estimated usinga stochastic gradient descent (SGD) algorithm. This model has beendeprecated in Spark 2.0 in favor of the LogisticRegressionWithLBFGSmodel.The LogisticRegressionWithLBFGS model uses the Limited-memoryBroyden–Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm.It is a quasi-Newton method that approximates the BFGS algorithm.NoteFor those of you who are mathematically adept and interested in this, wesuggest perusing this blog post that is a nice walk-through of theoptimization algorithms: http://aria42.com/blog/2014/12/understanding-lbfgs.First, we train the model on our data:from pyspark.mllib.classification \ import LogisticRegressionWithLBFGSLR_Model = LogisticRegressionWithLBFGS \ .train(births_train, iterations=10)Training the model is very simple: we just need to call the .train(...)method. The required parameters are the RDD with LabeledPoints; wealso specified the number of iterations so it does not take too long tohttps://www.iteblog.comhttp://aria42.com/blog/2014/12/understanding-lbfgsrun.Having trained the model using the births_train dataset, let's use themodel to predict the classes for our testing set:LR_results = ( births_test.map(lambda row: row.label) \ .zip(LR_Model \ .predict(births_test\ .map(lambda row: row.features))) ).map(lambda row: (row[0], row[1] * 1.0))The preceding snippet creates an RDD where each element is a tuple,with the first element being the actual label and the second one, themodel's prediction.MLlib provides an evaluation metric for classificationJava/Scala/JVM worlds and Python (and more recently R) worlds. Manyof the previous books for Spark have been focused on either all of thecore languages, or primarily focused on JVM languages, so it's great tosee PySpark get its chance to shine with a dedicated book from suchexperienced Spark educators. By supporting both of these differentworlds, we are able to more effectively work together as Data Scientistsand Data Engineers, while stealing the best ideas from each other'scommunities.It has been a privilege to have the opportunity to review early versionsof this book, which has only increased my excitement for the project.I've had the privilege of being at some of the same conferences andmeetups and watching the authors introduce new concepts in the worldof Spark to a variety of audiences (from first timers to old hands), andthey've done a great job distilling their experience for this book. Theexperience of the authors shines through with everything from theirexplanations to the topics covered. Beyond simply introducing PySparkthey have also taken the time to look at up and coming packages fromthe community, such as GraphFrames and TensorFrames.I think the community is one of those often-overlooked componentswhen deciding what tools to use, and Python has a great community andI'm looking forward to you joining the Python Spark community. So,enjoy your adventure; I know you are in good hands with Denny Leeand Tomek Drabas. I truly believe that by having a diverse communityof Spark users we will be able to make better tools useful for everyone,so I hope to see you around at one of the conferences, meetups, ormailing lists soon :)Holden Karauhttps://www.iteblog.comP.S.I owe Denny a beer; if you want to buy him a Bud Light lime (or lime-a-rita) for me I'd be much obliged (although he might not be quite asamused as I am).https://www.iteblog.comAbout the AuthorsTomasz Drabas is a Data Scientist working for Microsoft and currentlyresiding in the Seattle area. He has over 13 years of experience in dataanalytics and data science in numerous fields: advanced technology,airlines, telecommunications, finance, and consulting he gained whileworking on three continents: Europe, Australia, and North America.While in Australia, Tomasz has been working on his PhD in OperationsResearch with a focus on choice modeling and revenue managementapplications in the airline industry.At Microsoft, Tomasz works with big data on a daily basis, solvingmachine learning problems such as anomaly detection, churn prediction,and pattern recognition using Spark.Tomasz has also authored the Practical Data Analysis Cookbookpublished by Packt Publishing in 2016.I would like to thank my family: Rachel, Skye, and Albert—you are thelove of my life and I cherish every day I spend with you! Thank you foralways standing by me and for encouraging me to push my career goalsfurther and further. Also, to my family and my in-laws for putting upwith me (in general).There are many more people that have influenced me over the years thatI would have to write another book to thank them all. You know whoyou are and I want to thank you from the bottom of my heart!However, I would not have gotten through my PhD if it was not forCzesia Wieruszewska; Czesiu - dziękuję za Twoją pomoc bez której nierozpocząłbym mojej podróży po Antypodach. Along with KrzysKrzysztoszek, you guys have always believed in me! Thank you!Denny Lee is a Principal Program Manager at Microsoft for the AzureDocumentDB team—Microsoft's blazing fast, planet-scale manageddocument store service. He is a hands-on distributed systems and datahttps://www.iteblog.comscience engineer with more than 18 years of experience developingInternet-scale infrastructure, data platforms, and predictive analyticssystems for both on-premise and cloud environments.He has extensive experience of building greenfield teams as well asturnaround/change catalyst. Prior to joining the Azure DocumentDBteam, Denny worked as a Technology Evangelist at Databricks; he hasbeen working with Apache Spark since 0.5. He was also the SeniorDirector of Data Sciences Engineering at Concur, and was on theincubation team that built Microsoft's Hadoop on Windows and Azureservice (currently known as HDInsight). Denny also has a Masters inBiomedical Informatics from Oregon Health and Sciences Universityand has architected and implemented powerful data solutions forenterprise healthcare customers for the last 15 years.I would like to thank my wonderful spouse, Hua-Ping, and my awesomedaughters, Isabella and Samantha. You are the ones who keep megrounded and help me reach for the stars!https://www.iteblog.comAbout the ReviewerHolden Karau is transgender Canadian, and an active open sourcecontributor. When not in San Francisco working as a softwaredevelopment engineer at IBM's Spark Technology Center, Holden talksinternationally on Spark and holds office hours at coffee shops at homeand abroad. Holden is a co-author of numerous books on Sparkincluding High Performance Spark (which she believes is the gift of theseason for those with expense accounts) & Learning Spark. Holden is aSpark committer, specializing in PySpark and Machine Learning. Priorto IBM she worked on a variety of distributed, search, and classificationproblems at Alpine, Databricks, Google, Foursquare, and Amazon. Shegraduated from the University of Waterloo with a Bachelor ofMathematics in Computer Science. Outside of software she enjoysplaying with fire, welding, scooters, poutine, and dancing.https://www.iteblog.comwww.PacktPub.comDid you know that Packt offers eBook versions of every bookpublished, with PDF and ePub files available? You can upgrade to theeBook version at www.PacktPub.com and as a print book customer, youare entitled to a discount on the eBook copy. Get in touch with us at for more details.At www.PacktPub.com, you can also read a collection of free technicalarticles, sign up for a range of free newsletters and receive exclusivediscounts and offers on Packt books and eBooks.https://www.packtpub.com/maptGet the most in-demand software skills with Mapt. Mapt gives you fullaccess to all Packt books and video courses, as well as industry-leadingtools to help you plan your personal development and advance yourcareer.Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browserhttps://www.iteblog.comhttp://www.PacktPub.commailto:customercare@packtpub.comhttp://www.PacktPub.comhttps://www.packtpub.com/maptCustomer FeedbackThanks for purchasing this Packt book. At Packt, quality is at the heartof our editorial process. To help us improve, please leave us an honestreview on this book's Amazon page athttps://www.amazon.com/dp/1786463709.If you'd like to join our team of regular reviewers, you can email us at. We award our regular reviewers withfree eBooks and videos in exchange for their valuable feedback. Help usbe relentless in improving our products!https://www.iteblog.comhttps://www.amazon.com/dp/1786463709mailto:customerreviews@packtpub.comPrefaceIt is estimated that in 2013 the whole world produced around 4.4zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as thehuman race) are expected to produce ten times that. With data gettinglarger literally by the second, and given the growing appetite for makingsense out of it, in 2004 Google employees Jeffrey Dean and SanjayGhemawat published the seminal paper MapReduce: Simplified DataProcessing on Large Clusters. Since then, technologies leveraging theconcept started growing very quickly with Apache Hadoop initiallybeing the most popular. It ultimately created a Hadoop ecosystem thatincluded abstraction layers such as Pig, Hive, and Mahout – allleveraging this simple conceptand regression.Let's check how well or how bad our model performed:import pyspark.mllib.evaluation as evLR_evaluation = ev.BinaryClassificationMetrics(LR_results)print('Area under PR: {0:.2f}' \ .format(LR_evaluation.areaUnderPR))print('Area under ROC: {0:.2f}' \ .format(LR_evaluation.areaUnderROC))LR_evaluation.unpersist()Here's what we got:The model performed reasonably well! The 85% area under thePrecision-Recall curve indicates a good fit. In this case, we might begetting slightly more predicted deaths (true and false positives). In thiscase, this is actually a good thing as it would allow doctors to put theexpectant mother and the infant under special care.https://www.iteblog.comThe area under Receiver-Operating Characteristic (ROC) can beunderstood as a probability of the model ranking higher than a randomlychosen positive instance compared to a randomly chosen negative one.A 63% value can be thought of as acceptable.NoteFor more on these metrics, we point interested readers tohttp://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves and http://gim.unmc.edu/dxtests/roc3.htm.Selecting only the most predictable featuresAny model that uses less features to predict a class accurately shouldalways be preferred to a more complex one. MLlib allows us to selectthe most predictable features using a Chi-Square selector.Here's how you do it:selector = ft.ChiSqSelector(4).fit(births_train)topFeatures_train = ( births_train.map(lambda row: row.label) \ .zip(selector \ .transform(births_train \ .map(lambda row: row.features))) ).map(lambda row: reg.LabeledPoint(row[0], row[1]))topFeatures_test = ( births_test.map(lambda row: row.label) \ .zip(selector \ .transform(births_test \ .map(lambda row: row.features))) ).map(lambda row: reg.LabeledPoint(row[0], row[1]))We asked the selector to return the four most predictive features fromthe dataset and train the selector using the births_train dataset. Wethen used the model to extract only those features from our training andtesting datasets.The .ChiSqSelector(...) method can only be used for numericalfeatures; categorical variables need to be either hashed or dummy codedbefore the selector can be used.https://www.iteblog.comhttp://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curveshttp://gim.unmc.edu/dxtests/roc3.htmRandom forest in MLlibWe are now ready to build the random forest model.The following code shows you how to do it:from pyspark.mllib.tree import RandomForestRF_model = RandomForest \ .trainClassifier(data=topFeatures_train, numClasses=2, categoricalFeaturesInfo={}, numTrees=6, featureSubsetStrategy='all', seed=666)The first parameter to the .trainClassifier(...) method specifies thetraining dataset. The numClasses one indicates how many classes ourtarget variable has. As the third parameter, you can pass a dictionarywhere the key is the index of a categorical feature in our RDD and thevalue for the key indicates the number of levels that the categoricalfeature has. The numTrees specifies the number of trees to be in theforest. The next parameter tells the model to use all the features in ourdataset instead of keeping only the most descriptive ones, while the lastone specifies the seed for the stochastic part of the model.Let's see how well our model did:RF_results = ( topFeatures_test.map(lambda row: row.label) \ .zip(RF_model \ .predict(topFeatures_test \ .map(lambda row: row.features))) )RF_evaluation = ev.BinaryClassificationMetrics(RF_results)print('Area under PR: {0:.2f}' \ .format(RF_evaluation.areaUnderPR))print('Area under ROC: {0:.2f}' \ .format(RF_evaluation.areaUnderROC))model_evaluation.unpersist()Here are the results:https://www.iteblog.comAs you can see, the Random Forest model with fewer featuresperformed even better than the logistic regression model. Let's see howthe logistic regression would perform with a reduced number of features:LR_Model_2 = LogisticRegressionWithLBFGS \ .train(topFeatures_train, iterations=10)LR_results_2 = ( topFeatures_test.map(lambda row: row.label) \ .zip(LR_Model_2 \ .predict(topFeatures_test \ .map(lambda row: row.features))) ).map(lambda row: (row[0], row[1] * 1.0))LR_evaluation_2 = ev.BinaryClassificationMetrics(LR_results_2)print('Area under PR: {0:.2f}' \ .format(LR_evaluation_2.areaUnderPR))print('Area under ROC: {0:.2f}' \ .format(LR_evaluation_2.areaUnderROC))LR_evaluation_2.unpersist()The results might surprise you:As you can see, both models can be simplified and still attain the samelevel of accuracy. Having said that, you should always opt for a modelwith fewer variables.https://www.iteblog.comSummaryIn this chapter, we looked at the capabilities of the MLlib package ofPySpark. Even though the package is currently in a maintenance modeand is not actively being worked on, it is still good to know how to useit. Also, for now it is the only package available to train models whilestreaming data. We used MLlib to clean up, transform, and get familiarwith the dataset of infant deaths. Using that knowledge we thensuccessfully built two models that aimed at predicting the chance ofinfant survival given the information about its mother, father, and placeof birth.In the next chapter, we will revisit the same problem, but using thenewer package that is currently the Spark recommended package formachine learning.https://www.iteblog.comChapter 6. Introducing the MLPackageIn the previous chapter, we worked with the MLlib package in Sparkthat operated strictly on RDDs. In this chapter, we move to the ML partof Spark that operates strictly on DataFrames. Also, according to theSpark documentation, the primary machine learning API for Spark isnow the DataFrame-based set of models contained in the spark.mlpackage.So, let's get to it!NoteIn this chapter, we will reuse a portion of the dataset we played withinthe previous chapter. The data can be downloaded fromhttp://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gzIn this chapter, you will learn how to do the following:Prepare transformers, estimators, and pipelinesPredict the chances of infant survival using models available in theML packageEvaluate the performance of the modelPerform parameter hyper-tuningUse other machine-learning models available in the packageOverview of the packageAt the top level, the package exposes three main abstract classes: aTransformer, an Estimator, and a Pipeline. We will shortly explaineach with some short examples. We will provide more concreteexamples of some of the models in the last section of this chapter.Transformerhttps://www.iteblog.comhttp://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gzThe Transformer class, like the name suggests, transforms your data by(normally) appending a new column to your DataFrame.At the high level, when deriving from the Transformer abstract class,each and every new Transformer needs to implement a.transform(...) method. The method, as a first and normally the onlyobligatory parameter, requires passing a DataFrame to be transformed.This, of course, varies method-by-method in the ML package: otherpopular parameters are inputCol and outputCol; these, however,frequently default to some predefined values, such as, for example,'features' for the inputCol parameter.There are many Transformers offered in the spark.ml.feature and wewill briefly describe them here (before we use some of them later in thischapter):Binarizer: Given a threshold, the method takes a continuousvariableand transforms it into a binary one.Bucketizer: Similar to the Binarizer, this method takes a list ofthresholds (the splits parameter) and transforms a continuousvariable into a multinomial one.ChiSqSelector: For the categorical target variables (thinkclassification models), this feature allows you to select a predefinednumber of features (parameterized by the numTopFeaturesparameter) that explain the variance in the target the best. Theselection is done, as the name of the method suggests, using a Chi-Square test. It is one of the two-step methods: first, you need to.fit(...) your data (so the method can calculate the Chi-squaretests). Calling the .fit(...) method (you pass your DataFrame as aparameter) returns a ChiSqSelectorModel object that you can thenuse to transform your DataFrame using the .transform(...)method.NoteMore information on Chi-squares can be found here:http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.htmlhttps://www.iteblog.comhttp://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.htmlCountVectorizer: This is useful for a tokenized text (such as[['Learning', 'PySpark', 'with', 'us'],['us', 'us', 'us']]).It is one of two-step methods: first, you need to .fit(...), that is,learn the patterns from your dataset, before you can.transform(...) with the CountVectorizerModel returned by the.fit(...) method. The output from this transformer, for thetokenized text presented previously, would look similar to this: [(4,[0, 1, 2, 3], [1.0, 1.0, 1.0, 1.0]),(4, [3], [3.0])].DCT: The Discrete Cosine Transform takes a vector of real valuesand returns a vector of the same length, but with the sum of cosinefunctions oscillating at different frequencies. Such transformationsare useful to extract some underlying frequencies in your data or indata compression.ElementwiseProduct: A method that returns a vector with elementsthat are products of the vector passed to the method, and a vectorpassed as the scalingVec parameter. For example, if you had a[10.0, 3.0, 15.0] vector and your scalingVec was [0.99, 3.30,0.66], then the vector you would get would look as follows: [9.9,9.9, 9.9].HashingTF: A hashing trick transformer that takes a list of tokenizedtext and returns a vector (of predefined length) with counts. FromPySpark's documentation:"Since a simple modulo is used to transform the hash functionto a column index, it is advisable to use a power of two as thenumFeatures parameter; otherwise the features will not bemapped evenly to the columns."IDF: This method computes an Inverse Document Frequency for alist of documents. Note that the documents need to already berepresented as a vector (for example, using either the HashingTF orCountVectorizer).IndexToString: A complement to the StringIndexer method. It usesthe encoding from the StringIndexerModel object to reverse thestring index to original values. As an aside, please note that thissometimes does not work and you need to specify the values fromhttps://www.iteblog.comthe StringIndexer.MaxAbsScaler: Rescales the data to be within the [-1.0, 1.0] range(thus, it does not shift the center of the data).MinMaxScaler: This is similar to the MaxAbsScaler with thedifference that it scales the data to be in the [0.0, 1.0] range.NGram: This method takes a list of tokenized text and returns n-grams: pairs, triples, or n-mores of subsequent words. For example,if you had a ['good', 'morning', 'Robin', 'Williams'] vectoryou would get the following output: ['good morning', 'morningRobin', 'Robin Williams'].Normalizer: This method scales the data to be of unit norm using thep-norm value (by default, it is L2).OneHotEncoder: This method encodes a categorical column to acolumn of binary vectors.PCA: Performs the data reduction using principal componentanalysis.PolynomialExpansion: Performs a polynomial expansion of a vector.For example, if you had a vector symbolically written as [x, y, z],the method would produce the following expansion: [x, x*x, y,x*y, y*y, z, x*z, y*z, z*z].QuantileDiscretizer: Similar to the Bucketizer method, butinstead of passing the splits parameter, you pass the numBuckets one.The method then decides, by calculating approximate quantiles overyour data, what the splits should be.RegexTokenizer: This is a string tokenizer using regular expressions.RFormula: For those of you who are avid R users, you can pass aformula such as vec ~ alpha * 3 + beta (assuming your DataFramehas the alpha and beta columns) and it will produce the vec columngiven the expression.SQLTransformer: Similar to the previous, but instead of R-likeformulas, you can use SQL syntax.TipThe FROM statement should be selecting from __THIS__, indicatingyou are accessing the DataFrame. For example: SELECT alpha * 3+ beta AS vec FROM __THIS__.https://www.iteblog.comStandardScaler: Standardizes the column to have a 0 mean andstandard deviation equal to 1.StopWordsRemover: Removes stop words (such as 'the' or 'a') froma tokenized text.StringIndexer: Given a list of all the words in a column, this willproduce a vector of indices.Tokenizer: This is the default tokenizer that converts the string tolower case and then splits on space(s).VectorAssembler: This is a highly useful transformer that collatesmultiple numeric (vectors included) columns into a single columnwith a vector representation. For example, if you had three columnsin your DataFrame:df = spark.createDataFrame( [(12, 10, 3), (1, 4, 2)], ['a', 'b', 'c']) The output of calling:ft.VectorAssembler(inputCols=['a', 'b', 'c'], outputCol='features')\ .transform(df) \ .select('features')\ .collect() It would look as follows:[Row(features=DenseVector([12.0, 10.0, 3.0])), Row(features=DenseVector([1.0, 4.0, 2.0]))]VectorIndexer: This is a method for indexing categorical columnsinto a vector of indices. It works in a column-by-column fashion,selecting distinct values from the column, sorting and returning anindex of the value from the map instead of the original value.VectorSlicer: Works on a feature vector, either dense or sparse:given a list of indices, it extracts the values from the feature vector.Word2Vec: This method takes a sentence (string) as an input andtransforms it into a map of {string, vector} format, arepresentation that is useful in natural language processing.Notehttps://www.iteblog.comNote that there are many methods in the ML package that have an Eletter next to it; this means the method is currently in beta (orExperimental) and it sometimes might fail or produce erroneousresults. Beware.EstimatorsEstimators can be thought of as statistical models that need to beestimated to make predictions or classify your observations.If deriving from the abstract Estimator class, the new model has toimplement the .fit(...) method that fits the model given the datafound in a DataFrame and some default or user-specified parameters.There are a lot of estimators available in PySpark and we will nowshortly describe the models available in Spark 2.0.ClassificationThe ML package provides a data scientist with seven classificationmodels to choose from. These range from the simplest ones (such aslogistic regression) to more sophisticated ones. We will provide shortdescriptions of each of them in the following section:LogisticRegression: The benchmark model for classification. Thelogistic regression uses a logit function to calculate the probability ofan observation belonging to a particular class. At the time of writing,the PySpark ML supports only binary classification problems.DecisionTreeClassifier: A classifier that builds a decision tree topredict a class for an observation. Specifying the maxDepthparameter limits the depth the tree grows, the minInstancePerNodedetermines the minimum number of observations in the tree noderequired to further split, the maxBins parameterspecifies themaximum number of bins the continuous variables will be split into,and the impurity specifies the metric to measure and calculate theinformation gain from the split.GBTClassifier: A Gradient Boosted Trees model for classification.The model belongs to the family of ensemble models: models thathttps://www.iteblog.comcombine multiple weak predictive models to form a strong one. Atthe moment, the GBTClassifier model supports binary labels, andcontinuous and categorical features.RandomForestClassifier: This model produces multiple decisiontrees (hence the name—forest) and uses the mode output of thosedecision trees to classify observations. The RandomForestClassifiersupports both binary and multinomial labels.NaiveBayes: Based on the Bayes' theorem, this model usesconditional probability theory to classify observations. TheNaiveBayes model in PySpark ML supports both binary andmultinomial labels.MultilayerPerceptronClassifier: A classifier that mimics thenature of a human brain. Deeply rooted in the Artificial NeuralNetworks theory, the model is a black-box, that is, it is not easy tointerpret the internal parameters of the model. The model consists,at a minimum, of three, fully connected layers (a parameter thatneeds to be specified when creating the model object) of artificialneurons: the input layer (that needs to be equal to the number offeatures in your dataset), a number of hidden layers (at least one),and an output layer with the number of neurons equal to the numberof categories in your label. All the neurons in the input and hiddenlayers have a sigmoid activation function, whereas the activationfunction of the neurons in the output layer is softmax.OneVsRest: A reduction of a multiclass classification to a binary one.For example, in the case of a multinomial label, the model can trainmultiple binary logistic regression models. For example, if label ==2, the model will build a logistic regression where it will convert thelabel == 2 to 1 (all remaining label values would be set to 0) andthen train a binary model. All the models are then scored and themodel with the highest probability wins.RegressionThere are seven models available for regression tasks in the PySpark MLpackage. As with classification, these range from some basic ones (suchas the obligatory linear regression) to more complex ones:AFTSurvivalRegression: Fits an Accelerated Failure Timehttps://www.iteblog.comregression model. It is a parametric model that assumes that amarginal effect of one of the features accelerates or decelerates alife expectancy (or process failure). It is highly applicable for theprocesses with well-defined stages.DecisionTreeRegressor: Similar to the model for classification withan obvious distinction that the label is continuous instead of binary(or multinomial).GBTRegressor: As with the DecisionTreeRegressor, the difference isthe data type of the label.GeneralizedLinearRegression: A family of linear models withdiffering kernel functions (link functions). In contrast to the linearregression that assumes normality of error terms, the GLM allowsthe label to have different error term distributions: theGeneralizedLinearRegression model from the PySpark MLpackage supports gaussian, binomial, gamma, and poisson familiesof error distributions with a host of different link functions.IsotonicRegression: A type of regression that fits a free-form, non-decreasing line to your data. It is useful to fit the datasets withordered and increasing observations.LinearRegression: The most simple of regression models, it assumesa linear relationship between features and a continuous label, andnormality of error terms.RandomForestRegressor: Similar to either DecisionTreeRegressoror GBTRegressor, the RandomForestRegressor fits a continuous labelinstead of a discrete one.ClusteringClustering is a family of unsupervised models that are used to findunderlying patterns in your data. The PySpark ML package provides thefour most popular models at the moment:BisectingKMeans: A combination of the k-means clustering methodand hierarchical clustering. The algorithm begins with allobservations in a single cluster and iteratively splits the data into kclusters.Notehttps://www.iteblog.comCheck out this website for more information on pseudo-algorithms:http://minethedata.blogspot.com/2012/08/bisecting-k-means.html.KMeans: This is the famous k-mean algorithm that separates data intok clusters, iteratively searching for centroids that minimize the sumof square distances between each observation and the centroid ofthe cluster it belongs to.GaussianMixture: This method uses k Gaussian distributions withunknown parameters to dissect the dataset. Using the Expectation-Maximization algorithm, the parameters for the Gaussians are foundby maximizing the log-likelihood function.TipBeware that for datasets with many features this model mightperform poorly due to the curse of dimensionality and numericalissues with Gaussian distributions.LDA: This model is used for topic modeling in natural languageprocessing applications.There is also one recommendation model available in PySpark ML, butwe will refrain from describing it here.PipelineA Pipeline in PySpark ML is a concept of an end-to-endtransformation-estimation process (with distinct stages) that ingestssome raw data (in a DataFrame form), performs the necessary datacarpentry (transformations), and finally estimates a statistical model(estimator).TipA Pipeline can be purely transformative, that is, consisting ofTransformers only.A Pipeline can be thought of as a chain of multiple discrete stages.When a .fit(...) method is executed on a Pipeline object, all thehttps://www.iteblog.comhttp://minethedata.blogspot.com/2012/08/bisecting-k-means.htmlstages are executed in the order they were specified in the stagesparameter; the stages parameter is a list of Transformer and Estimatorobjects. The .fit(...) method of the Pipeline object executes the.transform(...) method for the Transformers and the .fit(...)method for the Estimators.Normally, the output of a preceding stage becomes the input for thefollowing stage: when deriving from either the Transformer orEstimator abstract classes, one needs to implement the.getOutputCol() method that returns the value of the outputColparameter specified when creating an object.https://www.iteblog.comPredicting the chances of infantsurvival with MLIn this section, we will use the portion of the dataset from the previouschapter to present the ideas of PySpark ML.NoteIf you have not yet downloaded the data while reading the previouschapter, it can be accessed here:http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gzIn this section, we will, once again, attempt to predict the chances of thesurvival of an infant.Loading the dataFirst, we load the data with the help of the following code:import pyspark.sql.types as typlabels = [ ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()), ('BIRTH_PLACE', typ.StringType()), ('MOTHER_AGE_YEARS', typ.IntegerType()), ('FATHER_COMBINED_AGE', typ.IntegerType()), ('CIG_BEFORE', typ.IntegerType()), ('CIG_1_TRI', typ.IntegerType()), ('CIG_2_TRI', typ.IntegerType()), ('CIG_3_TRI', typ.IntegerType()), ('MOTHER_HEIGHT_IN', typ.IntegerType()), ('MOTHER_PRE_WEIGHT', typ.IntegerType()), ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()), ('MOTHER_WEIGHT_GAIN', typ.IntegerType()), ('DIABETES_PRE', typ.IntegerType()), ('DIABETES_GEST', typ.IntegerType()), ('HYP_TENS_PRE', typ.IntegerType()), ('HYP_TENS_GEST', typ.IntegerType()), ('PREV_BIRTH_PRETERM', typ.IntegerType())]schema = typ.StructType([https://www.iteblog.comhttp://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz typ.StructField(e[0], e[1], False) for e in labels])births = spark.read.csv('births_transformed.csv.gz',header=True, schema=schema)We specify the schema of the DataFrame; our severely limited datasetnow only has 17 columns.Creating transformersBefore we can use the dataset to estimate a model, we need to do sometransformations. Since statistical models can only operate on numericdata, we will have to encode the BIRTH_PLACE variable.Before we do any of this, since we will use a number of different featuretransformations later in this chapter, let's import them all:import pyspark.ml.feature as ftTo encode the BIRTH_PLACE column, we will use the OneHotEncodermethod. However, the method cannot accept StringType columns; itcan only deal with numeric types so first we will cast the column to anIntegerType:births = births \ .withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE'] \ .cast(typ.IntegerType()))Having done this, we can now create our first Transformer:encoder = ft.OneHotEncoder( inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC')Let's now create a single column with all the features collated together.We will use the VectorAssembler method:featuresCreator = ft.VectorAssembler( inputCols=[ col[0] for col https://www.iteblog.com in labels[2:]] + \ [encoder.getOutputCol()], outputCol='features')The inputCols parameter passed to the VectorAssembler object is a listof all the columns to be combined together to form the outputCol—the'features'. Note that we use the output of the encoder object (bycalling the .getOutputCol() method), so we do not have to remember tochange this parameter's value should we change the name of the outputcolumn in the encoder object at any point.It's now time to create our first estimator.Creating an estimatorIn this example, we will (once again) use the logistic regression model.However, later in the chapter, we will showcase some more complexmodels from the .classification set of PySpark ML models, so weload the whole section:import pyspark.ml.classification as clOnce loaded, let's create the model by using the following code:logistic = cl.LogisticRegression( maxIter=10, regParam=0.01, labelCol='INFANT_ALIVE_AT_REPORT')We would not have to specify the labelCol parameter if our targetcolumn had the name 'label'. Also, if the output of ourfeaturesCreator was not called 'features', we would have to specifythe featuresCol by (most conveniently) calling the getOutputCol()method on the featuresCreator object.Creating a pipelineAll that is left now is to create a Pipeline and fit the model. First, let'sload the Pipeline from the ML package:https://www.iteblog.comfrom pyspark.ml import PipelineCreating a Pipeline is really easy. Here's how our pipeline should looklike conceptually:Converting this structure into a Pipeline is a walk in the park:pipeline = Pipeline(stages=[ encoder, featuresCreator, logistic ])That's it! Our pipeline is now created so we can (finally!) estimate themodel.Fitting the modelBefore you fit the model, we need to split our dataset into training andtesting datasets. Conveniently, the DataFrame API has the.randomSplit(...) method:births_train, births_test = births \ .randomSplit([0.7, 0.3], seed=666)The first parameter is a list of dataset proportions that should end up in,respectively, births_train and births_test subsets. The seedparameter provides a seed to the randomizer.NoteYou can also split the dataset into more than two subsets as long as theelements of the list sum up to 1, and you unpack the output into as manysubsets.https://www.iteblog.comFor example, we could split the births dataset into three subsets like this:train, test, val = births.\ randomSplit([0.7, 0.2, 0.1], seed=666)The preceding code would put a random 70% of the births dataset intothe train object, 20% would go to the test, and the val DataFramewould hold the remaining 10%.Now it is about time to finally run our pipeline and estimate our model:model = pipeline.fit(births_train)test_model = model.transform(births_test)The .fit(...) method of the pipeline object takes our training datasetas an input. Under the hood, the births_train dataset is passed first tothe encoder object. The DataFrame that is created at the encoder stagethen gets passed to the featuresCreator that creates the 'features'column. Finally, the output from this stage is passed to the logisticobject that estimates the final model.The .fit(...) method returns the PipelineModel object (the modelobject in the preceding snippet) that can then be used for prediction; weattain this by calling the .transform(...) method and passing thetesting dataset created earlier. Here's what the test_model looks like inthe following command:test_model.take(1)It generates the following output:https://www.iteblog.comAs you can see, we get all the columns from the Transfomers andEstimators. The logistic regression model outputs several columns: therawPrediction is the value of the linear combination of features and theβ coefficients, the probability is the calculated probability for each ofthe classes, and finally, the prediction is our final class assignment.Evaluating the performance of the modelObviously, we would like to now test how well our model did. PySparkexposes a number of evaluation methods for classification andregression in the .evaluation section of the package:import pyspark.ml.evaluation as evWe will use the BinaryClassficationEvaluator to test how well ourmodel performed:evaluator = ev.BinaryClassificationEvaluator( rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT')The rawPredictionCol can either be the rawPrediction columnproduced by the estimator or the probability.Let's see how well our model performed:print(evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderROC'}))print(evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderPR'}))The preceding code produces the following result:The area under the ROC of 74% and area under PR of 71% shows awell-defined model, but nothing out of extraordinary; if we had otherhttps://www.iteblog.comfeatures, we could drive this up, but this is not the purpose of thischapter (nor the book, for that matter).Saving the modelPySpark allows you to save the Pipeline definition for later use. It notonly saves the pipeline structure, but also all the definitions of all theTransformers and Estimators:pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline'pipeline.write().overwrite().save(pipelinePath)So, you can load it up later and use it straight away to .fit(...) andpredict:loadedPipeline = Pipeline.load(pipelinePath)loadedPipeline \ .fit(births_train)\ .transform(births_test)\ .take(1)The preceding code produces the same result (as expected):If you, however, want to save the estimated model, you can also do that;instead of saving the Pipeline, you need to save the PipelineModel.TipNote, that not only the PipelineModel can be saved: virtually all themodels that are returned by calling the .fit(...) method on anEstimator or Transformer can be saved and loaded back to be reused.https://www.iteblog.comTo save your model, see the following the example:from pyspark.ml import PipelineModelmodelPath = './infant_oneHotEncoder_Logistic_PipelineModel'model.write().overwrite().save(modelPath)loadedPipelineModel = PipelineModel.load(modelPath)test_reloadedModel = loadedPipelineModel.transform(births_test)The preceding script uses the .load(...) method, a class method of thePipelineModel class, to reload the estimated model. You can comparethe result of test_reloadedModel.take(1) with the output oftest_model.take(1) we presented earlier.https://www.iteblog.comParameter hyper-tuningRarely, our first model would be the best we can do. By simply lookingat our metrics and accepting themodel because it passed our pre-conceived performance thresholds is hardly a scientific method forfinding the best model.A concept of parameter hyper-tuning is to find the best parameters ofthe model: for example, the maximum number of iterations needed toproperly estimate the logistic regression model or maximum depth of adecision tree.In this section, we will explore two concepts that allow us to find thebest parameters for our models: grid search and train-validation splitting.Grid searchGrid search is an exhaustive algorithm that loops through the list ofdefined parameter values, estimates separate models, and chooses thebest one given some evaluation metric.A note of caution should be stated here: if you define too manyparameters you want to optimize over, or too many values of theseparameters, it might take a lot of time to select the best model as thenumber of models to estimate would grow very quickly as the number ofparameters and parameter values grow.For example, if you want to fine-tune two parameters with twoparameter values, you would have to fit four models. Adding one moreparameter with two values would require estimating eight models,whereas adding one more additional value to our two parameters(bringing it to three values for each) would require estimating ninemodels. As you can see, this can quickly get out of hand if you are notcareful. See the following chart to inspect this visually:https://www.iteblog.comAfter this cautionary tale, let's get to fine-tuning our parameters space.First, we load the .tuning part of the package:import pyspark.ml.tuning as tuneNext, let's specify our model and the list of parameters we want to loopthrough:logistic = cl.LogisticRegression( labelCol='INFANT_ALIVE_AT_REPORT')grid = tune.ParamGridBuilder() \ .addGrid(logistic.maxIter, [2, 10, 50]) \ .addGrid(logistic.regParam, [0.01, 0.05, 0.3]) \ .build()https://www.iteblog.comFirst, we specify the model we want to optimize the parameters of. Next,we decide which parameters we will be optimizing, and what values forthose parameters to test. We use the ParamGridBuilder() object fromthe .tuning subpackage, and keep adding the parameters to the gridwith the .addGrid(...) method: the first parameter is the parameterobject of the model we want to optimize (in our case, these arelogistic.maxIter and logistic.regParam), and the second parameter isa list of values we want to loop through. Calling the .build() method onthe .ParamGridBuilder builds the grid.Next, we need some way of comparing the models:evaluator = ev.BinaryClassificationEvaluator( rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT')So, once again, we'll use the BinaryClassificationEvaluator. It is timenow to create the logic that will do the validation work for us:cv = tune.CrossValidator( estimator=logistic, estimatorParamMaps=grid, evaluator=evaluator)The CrossValidator needs the estimator, the estimatorParamMaps, andthe evaluator to do its job. The model loops through the grid of values,estimates the models, and compares their performance using theevaluator.We cannot use the data straight away (as the births_train andbirths_test still have the BIRTHS_PLACE column not encoded) so wecreate a purely transforming Pipeline:pipeline = Pipeline(stages=[encoder ,featuresCreator])data_transformer = pipeline.fit(births_train)Having done this, we are ready to find the optimal combination ofparameters for our model:cvModel = cv.fit(data_transformer.transform(births_train))https://www.iteblog.comThe cvModel will return the best model estimated. We can now use it tosee if it performed better than our previous model:data_train = data_transformer \ .transform(births_test)results = cvModel.transform(data_train)print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderROC'}))print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderPR'}))The preceding code will produce the following result:As you can see, we got a slightly better result. What parameters does thebest model have? The answer is a little bit convoluted, but here's howyou can extract it:results = [ ( [ {key.name: paramValue} for key, paramValue in zip( params.keys(), params.values()) ], metric ) for params, metric in zip( cvModel.getEstimatorParamMaps(), cvModel.avgMetrics )]sorted(results, key=lambda el: el[1], reverse=True)[0]The preceding code produces the following output:https://www.iteblog.comTrain-validation splittingThe TrainValidationSplit model, to select the best model, performs arandom split of the input dataset (the training dataset) into two subsets:smaller training and validation subsets. The split is only performed once.In this example, we will also use the ChiSqSelector to select only thetop five features, thus limiting the complexity of our model:selector = ft.ChiSqSelector( numTopFeatures=5, featuresCol=featuresCreator.getOutputCol(), outputCol='selectedFeatures', labelCol='INFANT_ALIVE_AT_REPORT')The numTopFeatures specifies the number of features to return. We willput the selector after the featuresCreator, so we call the.getOutputCol() on the featuresCreator.We covered creating the LogisticRegression and Pipeline earlier, sowe will not explain how these are created again here:logistic = cl.LogisticRegression( labelCol='INFANT_ALIVE_AT_REPORT', featuresCol='selectedFeatures')pipeline = Pipeline(stages=[encoder, featuresCreator, selector])data_transformer = pipeline.fit(births_train)The TrainValidationSplit object gets created in the same fashion asthe CrossValidator model:tvs = tune.TrainValidationSplit( estimator=logistic, estimatorParamMaps=grid, https://www.iteblog.com evaluator=evaluator)As before, we fit our data to the model, and calculate the results:tvsModel = tvs.fit( data_transformer \ .transform(births_train))data_train = data_transformer \ .transform(births_test)results = tvsModel.transform(data_train)print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderROC'}))print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderPR'}))The preceding code prints out the following output:Well, the model with less features certainly performed worse than thefull model, but the difference was not that great. Ultimately, it is aperformance trade-off between a more complex model and the lesssophisticated one.https://www.iteblog.comOther features of PySpark ML inactionAt the beginning of this chapter, we described most of the features ofthe PySpark ML library. In this section, we will provide examples ofhow to use some of the Transformers and Estimators.Feature extractionWe have used quite a few models from this submodule of PySpark. Inthis section, we'll show you how to use the most useful ones (in ouropinion).NLP - related feature extractorsAs described earlier, the NGram model takes a list of tokenized text andproduces pairs (or n-grams) of words.In this example, we will take an excerpt from PySpark's documentationand present how to clean up the text before passing it to the NGrammodel. Here's how our dataset looks like (abbreviated for brevity):TipFor the full view of how the following snippet looks like, pleasedownload the code from our GitHub repository:https://github.com/drabastomek/learningPySpark.We copied these four paragraphs from the description of the DataFrameusage in Pipelines: http://spark.apache.org/docs/latest/ml-pipeline.html#dataframe.text_data = spark.createDataFrame([ ['''Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adoptsthe DataFrame from Spark SQL in order to support a variety of data types.'''],https://www.iteblog.comhttps://github.com/drabastomek/learningPySparkhttp://spark.apache.org/docs/latest/ml-pipeline.html#dataframe (...) ['''Columns in a DataFrame are named. The code examples below use names such as "text," "features," and "label."''']], ['input'])Each row in our single-column DataFrame is just a bunch of text. First,we need to tokenize this text. To do so we will use the RegexTokenizerinstead of just the Tokenizer as we can specify the pattern(s) we wantthe text to be broken at:tokenizer = ft.RegexTokenizer( inputCol='input', outputCol='input_arr', pattern='\s+|[,.\"]')The pattern here splits the text on any number of spaces, but alsoremoves commas, full stops, backslashes, and quotation marks. A singlerow from the output of the tokenizer looks similar to this:As you can see, the RegexTokenizer not only splits the sentences in towords, but also normalizes the text so each word is in small-caps.However, there is still plenty of junk in our text: words such as be, a, orto normally provide us with nothing useful when analyzing a text. Thus,we will remove these so called stopwords using nothing else other thanthe StopWordsRemover(...):stopwords = ft.StopWordsRemover( inputCol=tokenizer.getOutputCol(), outputCol='input_stop')The output of the method looks as follows:https://www.iteblog.comNow we only have the useful words. So, let's build our NGram model andthe Pipeline:ngram = ft.NGram(n=2, inputCol=stopwords.getOutputCol(), outputCol="nGrams")pipeline = Pipeline(stages=[tokenizer, stopwords, ngram])Now that we have the pipeline, we follow in a very similar fashion asbefore:data_ngram = pipeline \ .fit(text_data) \ .transform(text_data)data_ngram.select('nGrams').take(1)The preceding code produces the following output:That's it. We have got our n-grams and we can now use them in furtherNLP processing.Discretizing continuous variablesEver so often, we deal with a continuous feature that is highly non-linearand really hard to fit in our model with only one coefficient.https://www.iteblog.comIn such a situation, it might be hard to explain the relationship betweensuch a feature and the target with just one coefficient. Sometimes, it isuseful to band the values into discrete buckets.First, let's create some fake data with the help of the following code:import numpy as npx = np.arange(0, 100)x = x / 100.0 * np.pi * 4y = x * np.sin(x / 1.764) + 20.1234Now, we can create a DataFrame by using the following code:schema = typ.StructType([ typ.StructField('continuous_var', typ.DoubleType(), False )])data = spark.createDataFrame( [[float(e), ] for e in y], schema=schema)https://www.iteblog.comNext, we will use the QuantileDiscretizer model to split ourcontinuous variable into five buckets (the numBuckets parameter):discretizer = ft.QuantileDiscretizer( numBuckets=5, inputCol='continuous_var', outputCol='discretized')Let's see what we have got:data_discretized = discretizer.fit(data).transform(data)Our function now looks as follows:https://www.iteblog.comWe can now treat this variable as categorical and use the OneHotEncoderto encode it for future use.Standardizing continuous variablesStandardizing continuous variables helps not only in betterunderstanding the relationships between the features (as interpreting thecoefficients becomes easier), but it also aids computational efficiencyand protects from running into some numerical traps. Here's how you doit with PySpark ML.First, we need to create a vector representation of our continuousvariable (as it is only a single float):vectorizer = ft.VectorAssembler(https://www.iteblog.com inputCols=['continuous_var'], outputCol= 'continuous_vec')Next, we build our normalizer and the pipeline. By setting thewithMean and withStd to True, the method will remove the mean andscale the variance to be of unit length:normalizer = ft.StandardScaler( inputCol=vectorizer.getOutputCol(), outputCol='normalized', withMean=True, withStd=True)pipeline = Pipeline(stages=[vectorizer, normalizer])data_standardized = pipeline.fit(data).transform(data)Here's what the transformed data would look like:https://www.iteblog.comAs you can see, the data now oscillates around 0 with the unit variance(the green line).ClassificationSo far we have only used the LogisticRegression model from PySparkML. In this section, we will use the RandomForestClassfier to, onceagain, model the chances of survival for an infant.Before we can do that, though, we need to cast the label feature toDoubleType:import pyspark.sql.functions as funcbirths = births.withColumn( 'INFANT_ALIVE_AT_REPORT', func.col('INFANT_ALIVE_AT_REPORT').cast(typ.DoubleType()))births_train, births_test = births \ .randomSplit([0.7, 0.3], seed=666)Now that we have the label converted to double, we are ready to buildour model. We progress in a similar fashion as before with the distinctionthat we will reuse the encoder and featureCreator from earlier in thechapter. The numTrees parameter specifies how many decision treesshould be in our random forest, and the maxDepth parameter limits thedepth of the trees:classifier = cl.RandomForestClassifier( numTrees=5, maxDepth=5, labelCol='INFANT_ALIVE_AT_REPORT')pipeline = Pipeline( stages=[ encoder, featuresCreator, classifier])model = pipeline.fit(births_train)test = model.transform(births_test)Let's now see how the RandomForestClassifier model performscompared to the LogisticRegression:https://www.iteblog.comevaluator = ev.BinaryClassificationEvaluator( labelCol='INFANT_ALIVE_AT_REPORT')print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderROC"}))print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderPR"}))We get the following results:Well, as you can see, the results are better than the logistic regressionmodel by roughly 3 percentage points. Let's test how well would amodel with one tree do:classifier = cl.DecisionTreeClassifier( maxDepth=5, labelCol='INFANT_ALIVE_AT_REPORT')pipeline = Pipeline(stages=[ encoder, featuresCreator, classifier])model = pipeline.fit(births_train)test = model.transform(births_test)evaluator = ev.BinaryClassificationEvaluator( labelCol='INFANT_ALIVE_AT_REPORT')print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderROC"}))print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderPR"}))The preceding code gives us the following:https://www.iteblog.comNot bad at all! It actually performed better than the random forest modelin terms of the precision-recall relationship and only slightly worse interms of the area under the ROC. We just might have found a winner!ClusteringClustering is another big part of machine learning: quite often, in the realworld, we do not have the luxury of having the target feature, so weneed to revert to an unsupervised learning paradigm, where we try touncover patterns in the data.Finding clusters in the births datasetIn this example, we will use the k-means model to find similarities in thebirths data:import pyspark.ml.clustering as cluskmeans = clus.KMeans(k = 5, featuresCol='features')pipeline = Pipeline(stages=[ assembler, featuresCreator, kmeans])model = pipeline.fit(births_train)Having estimated the model, let's see if we can find some differencesbetween clusters:test = model.transform(births_test)test \ .groupBy('prediction') \ .agg({ '*': 'count', 'MOTHER_HEIGHT_IN': 'avg' }).collect()The preceding code produces the following output:https://www.iteblog.comWell, the MOTHER_HEIGHT_IN is significantly different in cluster 2. Goingthrough the results (which we will not do here for obvious reasons)would most likely uncover more differences and allow us to understandthe data better.Topic miningClustering models are not limited to numeric data only. In the field ofNLP, problems such as topic extraction rely on clustering to detectdocuments with similar topics. We will go through such an example.First, let's create our dataset. The data is formed from randomly selectedparagraphs found on the Internet: three of them deal with topics ofnature and national parks, the remaining three cover technology.TipThe code snippet is abbreviated again, for obvious reasons. Refer to thesource file on GitHub for full representation.text_data = spark.createDataFrame([ ['''To make a computer do anything, you have to write a computer program. To write a computer program, you have to tell the computer, step by step, exactly what you want it to do. The computer then "executes" the program, following each step mechanically, to accomplish the end goal. When you are telling the computer what to do, you also get to choose how it's going to do it. That's where computer algorithms come in. The algorithm is the basic technique used to get the job done. Let's follow an https://www.iteblog.com example to help get an understanding of the algorithm concept.'''], (...), ['''Australia has over 500 national parks. Over 28 million hectares of land is designated as national parkland, accounting for almost four per cent of Australia's land areas. In addition, a further six per cent of Australia is protected and includes state forests, nature parks and conservation reserves.National parks are usually large areas of land that are protected because they have unspoilt landscapes and a diverse number of native plants and animals. This means that commercial activities such as farming are prohibited and human activity is strictly monitored.''']], ['documents'])First, we will once again use the RegexTokenizer and theStopWordsRemover models:tokenizer = ft.RegexTokenizer( inputCol='documents', outputCol='input_arr', pattern='\s+|[,.\"]')stopwords = ft.StopWordsRemover( inputCol=tokenizer.getOutputCol(), outputCol='input_stop')Next in our pipeline is the CountVectorizer: a model that counts wordsin a document and returns a vector of counts. The length of the vector isequal to the total number of distinct words in all the documents, whichcan be seen in the following snippet:stringIndexer = ft.CountVectorizer( inputCol=stopwords.getOutputCol(), outputCol="input_indexed")tokenized = stopwords \ .transform( tokenizer\ .transform(text_data) ) stringIndexer \ .fit(tokenized)\ .transform(tokenized)\ .select('input_indexed')\https://www.iteblog.com .take(2)The preceding code will produce the following output:As you can see, there are 262 distinct words in the text, and eachdocument is now represented by a count of each word occurrence.It's now time to start predicting the topics. For that purpose we will usethe LDA model—the Latent Dirichlet Allocation model:clustering = clus.LDA(k=2, optimizer='online', featuresCol=stringIndexer.getOutputCol())The k parameter specifies how many topics we expect to see, theoptimizer parameter can be either 'online' or 'em' (the latter standingfor the Expectation Maximization algorithm).Putting these puzzles together results in, so far, the longest of ourpipelines:pipeline = ml.Pipeline(stages=[ tokenizer, stopwords, stringIndexer, clustering])https://www.iteblog.comHave we properly uncovered the topics? Well, let's see:topics = pipeline \ .fit(text_data) \ .transform(text_data)topics.select('topicDistribution').collect()Here's what we get:Looks like our method discovered all the topics properly! Do not getused to seeing such good results though: sadly, real world data is seldomthat kind.RegressionWe could not finish a chapter on a machine learning library withoutbuilding a regression model.In this section, we will try to predict the MOTHER_WEIGHT_GAIN givensome of the features described here; these are contained in the featureslisted here:features = ['MOTHER_AGE_YEARS','MOTHER_HEIGHT_IN', 'MOTHER_PRE_WEIGHT','DIABETES_PRE', 'DIABETES_GEST','HYP_TENS_PRE', 'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM', 'CIG_BEFORE','CIG_1_TRI', 'CIG_2_TRI', 'CIG_3_TRI' ]First, since all the features are numeric, we will collate them togetherand use the ChiSqSelector to select only the top six most importanthttps://www.iteblog.comfeatures:featuresCreator = ft.VectorAssembler( inputCols=[col for col in features[1:]], outputCol='features')selector = ft.ChiSqSelector( numTopFeatures=6, outputCol="selectedFeatures", labelCol='MOTHER_WEIGHT_GAIN')In order to predict the weight gain, we will use the gradient boostedtrees regressor:import pyspark.ml.regression as regregressor = reg.GBTRegressor( maxIter=15, maxDepth=3, labelCol='MOTHER_WEIGHT_GAIN')Finally, again, we put it all together into a Pipeline:pipeline = Pipeline(stages=[ featuresCreator, selector, regressor])weightGain = pipeline.fit(births_train)Having created the weightGain model, let's see if it performs well on ourtesting data:evaluator = ev.RegressionEvaluator( predictionCol="prediction", labelCol='MOTHER_WEIGHT_GAIN')print(evaluator.evaluate( weightGain.transform(births_test), {evaluator.metricName: 'r2'}))We get the following output:https://www.iteblog.comSadly, the model is no better than a flip of a coin. It looks that withoutadditional independent features that are better correlated with theMOTHER_WEIGHT_GAIN label, we will not be able to explain its variancesufficiently.https://www.iteblog.comSummaryIn this chapter, we went into details of how to use PySpark ML: theofficial main machine learning library for PySpark. We explained whatthe Transformer and Estimator are, and showed their role in anotherconcept introduced in the ML library: the Pipeline. Subsequently, wealso presented how to use some of the methods to fine-tune the hyperparameters of models. Finally, we gave some examples of how to usesome of the feature extractors and models from the library.In the next chapter, we will delve into graph theory and GraphFramesthat help in tackling machine learning problems better represented asgraphs.https://www.iteblog.comChapter 7. GraphFramesGraphs are an interesting way to solve data problems because graphstructures are a more intuitive approach to many classes of dataproblems.In this chapter, you will learn about:Why use graphs?Understanding the classic graph problem: the flights datasetUnderstanding the graph vertices and edgesSimple queriesUsing motif findingUsing breadth first searchUsing PageRankVisualizing flights using D3Whether traversing social networks or restaurant recommendations, it iseasier to understand these data problems within the context of graphstructures: vertices, edges, and properties:https://www.iteblog.comFor example, within the context of social networks, the vertices are thepeople while the edges are the connections between them. Within thecontext of restaurant recommendations, the vertices (for example)involve the location, cuisine type, and restaurants while the edges arethe connections between them (for example, these three restaurants arein Vancouver, BC, but only two of them serve ramen).While the two graphs are seemingly disconnected, you can in fact createa social network + restaurant recommendation graph based on thereviews of friendswithin a social circle, as noted in the following figure:For example, if Isabella wants to find a great ramen restaurant inVancouver, traversing her friends' reviews, she will most likely chooseKintaro Ramen, as both Samantha and Juliette have rated therestaurant favorably:https://www.iteblog.comAnother classic graph problem is the analysis of flight data: airports arerepresented by vertices and flights between those airports arerepresented by edges. Also, there are numerous properties associatedwith these flights, including, but not limited to, departure delays, planetype, and carrier:https://www.iteblog.comIn this chapter, we will use GraphFrames to quickly and easily analyzeflight performance data organized in graph structures. Because we'reusing graph structures, we can easily ask many questions that are not asintuitive as tabular structures, such as finding structural motifs, airportranking using PageRank, and shortest paths between cities.GraphFrames leverages the distribution and expression capabilities ofthe DataFrame API to both simplify your queries and leverage theperformance optimizations of the Apache Spark SQL engine.In addition, with GraphFrames, graph analysis is available in Python,Scala, and Java. Just as important, you can leverage your existingApache Spark skills to solve graph problems (in addition to machinelearning, streaming, and SQL) instead of making a paradigm shift tolearn a new framework.Introducing GraphFrameshttps://www.iteblog.comGraphFrames utilizes the power of Apache Spark DataFrames to supportgeneral graph processing. Specifically, the vertices and edges arerepresented by DataFrames allowing us to store arbitrary data with eachvertex and edge. While GraphFrames is similar to Spark's GraphXlibrary, there are some key differences, including:GraphFrames leverage the performance optimizations and simplicityof the DataFrame API.By using the DataFrame API, GraphFrames now have Python, Java,and Scala APIs. GraphX is only accessible through Scala; now all itsalgorithms are available in Python and Java.Note, at the time of writing, there was a bug preventingGraphFrames from working with Python3.x, hence we will be usingPython2.x.At the time of writing, GraphFrames is on version 0.3 and available as aSpark package (http://spark-packages.org) at https://spark-packages.org/package/graphframes/graphframes.TipFor more information about GraphFrames, please refer to IntroducingGraphFra mes at https://databricks.com/blog/2016/03/03/introducing-graphframes.html.https://www.iteblog.comhttp://spark-packages.orghttps://spark-packages.org/package/graphframes/graphframeshttps://databricks.com/blog/2016/03/03/introducing-graphframes.htmlInstalling GraphFramesIf you are running your job from a Spark CLI (for example, spark-shell, pyspark, spark-sql, spark-submit), you can use the –-packages command, which will extract, compile, and execute thenecessary code for you to use the GraphFrames package.For example, to use the latest GraphFrames package (version 0.3) withSpark 2.0 and Scala 2.11 with spark-shell, the command is:> $SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11If you are using a notebook service, you may need to install the packagefirst. For example, the following section shows the steps to install theGraphFrames library within the free Databricks Community Edition(http://databricks.com/try-databricks).Creating a libraryWithin Databricks, you can create a library that is comprised of aScala/Java JAR, Python Egg, or Maven Coordinate (including the Sparkpackage).To start, go to your Workspace within databricks, right-click the folderyou want to create the library in (in this case, flights), click Create, andthen click Library:https://www.iteblog.comhttp://databricks.com/try-databricksIn the Create Library dialog, choose within the Source dropdown,Maven Coordinate as noted in the following diagram:https://www.iteblog.comTipMaven is a tool that is used to build and manage Java-based projectssuch as the GraphFrames project. Maven coordinates uniquely identifythose projects (or dependencies or plug-ins) so you can quickly find theproject within a Maven repository; for example,https://mvnrepository.com/artifact/graphframes/graphframes.From here, you can click the Search Spark Packages and MavenCentral button and search for the GraphFrames package. Ensure thatyou match the GraphFrames version of Spark (for example, Spark 2.0)and Scala (for example, Scala 2.11) with your Spark cluster.You can also enter the Maven coordinate for the GraphFrames Sparkpackage if you already know it. For Spark 2.0 and Scala 2.11, enter thefollowing coordinate:graphframes:graphframes:0.3.0-spark2.0-s_2.11https://www.iteblog.comhttps://mvnrepository.com/artifact/graphframes/graphframesOnce entered, click on Create Library, as noted in the followingscreenshot:Note that this is a one-time installation task for the GraphFrames Sparkpackage (as part of a library). Once it is installed, you can by defaultautomatically auto-attach the package to any Databricks cluster that youcreate:https://www.iteblog.comhttps://www.iteblog.comPreparing your flights datasetFor this flights sample scenario, we will make use of two sets of data:Airline On-Time Performance and Causes of Flight Delays:[http://bit.ly/2ccJPPM] This dataset contains scheduled and actualdeparture and arrival times, and delay causes as reported by US aircarriers. The data is collected by the Office of Airline Information,Bureau of Transportation Statistics (BTS).Open Flights: Airports and airline data:[http://openflights.org/data.html] This dataset contains the list of USairport data including the IATA code, airport name, and airportlocation.We will create two DataFrames – airports and departureDelays–whichwill make up our vertices and edges of our GraphFrame, respectively.We will be creating this flights sample application using Python.As we are using a Databricks notebook for our example, we can makeuse of the /databricks-datasets/location, which contains numeroussample datasets. You can also download the data from:departureDelays.csv: http://bit.ly/2ejPr8kairportCodes: http://bit.ly/2ePAdKTIn this example, we are creating two variables denoting the file paths forour Airports and Departure Delays data, respectively. Then we will loadthese datasets and create the respective Spark DataFrames; note forboth of these files, we can easily infer the schema:# Set File PathstripdelaysFilePath = "/databricks-datasets/flights/departuredelays.csv"airportsnaFilePath = "/databricks-datasets/flights/airport-codes-na.txt"# Obtain airports dataset# Note, this dataset is tab-delimited with a headerairportsna = spark.read.csv(airportsnaFilePath, header='true', https://www.iteblog.comhttp://bit.ly/2ccJPPMhttp://openflights.org/data.htmlhttp://bit.ly/2ejPr8khttp://bit.ly/2ePAdKTinferSchema='true', sep='\t')airportsna.createOrReplaceTempView("airports_na")# Obtain departure Delays data# Note, this dataset is comma-delimited with a headerdepartureDelays = spark.read.csv(tripdelaysFilePath, header='true')departureDelays.createOrReplaceTempView("departureDelays")departureDelays.cache()Once we loaded the departureDelays DataFrame, we also cache it sowe can include some additional filtering of the data in a performantmanner:# Available IATA codes from the departuredelays sample datasettripIATA = spark.sql("select distinct iata from (select distinct origin as iata from departureDelays union all select distinct destination as iata from departureDelays) a")tripIATA.createOrReplaceTempView("tripIATA")The preceding query allows us to build a distinct list with origin cityIATA codes (for example, Seattle = 'SEA', San Francisco = 'SFO',New York JFK = 'JFK', and so on). Next, we only include airports thathad a trip occur within the departureDelaysDataFrame:# Only include airports with atleast one trip from the # `departureDelays` datasetairports = spark.sql("select f.IATA, f.City, f.State, f.Country from airports_na f join tripIATA t on t.IATA = f.IATA")airports.createOrReplaceTempView("airports")airports.cache()By building the distinct list of origin airport codes, we can build theairports DataFrame to contain only the airport codes that exist in thedepartureDelays dataset. The following code snippet generates a newDataFrame (departureDelays_geo) that is comprised of key attributesincluding date of flight, delays, distance, and airport information (origin,destination):# Build `departureDelays_geo` DataFrame# Obtain key attributes such as Date of flight, delays, distance, # and airport information (Origin, Destination) https://www.iteblog.comdepartureDelays_geo = spark.sql("select cast(f.date as int) as tripid, cast(concat(concat(concat(concat(concat(concat('2014-', concat(concat(substr(cast(f.date as string), 1, 2), '-')), substr(cast(f.date as string), 3, 2)), ''), substr(cast(f.date as string), 5, 2)), ':'), substr(cast(f.date as string), 7, 2)), ':00') as timestamp) as `localdate`, cast(f.delay as int), cast(f.distance as int), f.origin as src, f.destination as dst, o.city as city_src, d.city as city_dst, o.state as state_src, d.state as state_dst from departuredelays f join airports o on o.iata = f.origin join airports d on d.iata = f.destination") # Create Temporary View and cachedepartureDelays_geo.createOrReplaceTempView("departureDelays_geo")departureDelays_geo.cache()To take a quick peek into this data, you can run the show method asshown here:# Review the top 10 rows of the `departureDelays_geo` DataFramedepartureDelays_geo.show(10)https://www.iteblog.comBuilding the graphNow that we've imported our data, let's build our graph. To do this, we'regoing to build the structure for our vertices and edges. At the time ofwriting, GraphFrames requires a specific naming convention for verticesand edges:The column representing the vertices needs to have the name ofid.In our case, the vertices of our flight data are the airports.Therefore, we will need to rename the IATA airport code to id inour airports DataFrame.The columns representing the edges need to have a source (src) anddestination (dst). For our flight data, the edges are the flights,therefore the src and dst are the origin and destination columnsfrom the departureDelays_geo DataFrame.To simplify the edges for our graph, we will create the tripEdgesDataFrame with a subset of the columns available within thedepartureDelays_Geo DataFrame. As well, we created a tripVerticesDataFrame that simply renames the IATA column to id to match theGraphFrame naming convention:# Note, ensure you have already installed # the GraphFrames spark-packagefrom pyspark.sql.functions import *from graphframes import *# Create Vertices (airports) and Edges (flights)tripVertices = airports.withColumnRenamed("IATA", "id").distinct()tripEdges = departureDelays_geo.select("tripid", "delay", "src", "dst", "city_dst", "state_dst")# Cache Vertices and EdgestripEdges.cache()tripVertices.cache()Within Databricks, you can query the data using the display command.For example, to view the tripEdges DataFrame, the command is asfollows:https://www.iteblog.comdisplay(tripEdges)The output is as follows:Now that we have the two DataFrames, we can create a GraphFrameusing the GraphFrame command:tripGraph = GraphFrame(tripVertices, tripEdges)https://www.iteblog.comExecuting simple queriesLet's start with a set of simple graph queries to understand flightperformance and departure delays.Determining the number of airports and tripsFor example, to determine the number of airports and trips, you can runthe following commands:print "Airports: %d" % tripGraph.vertices.count()print "Trips: %d" % tripGraph.edges.count()As you can see from the results, there are 279 airports with 1.36 milliontrips:Determining the longest delay in this datasetTo determine the longest delayed flight in the dataset, you can run thefollowing query with the result of 1,642 minutes (that's more than 27hours!):tripGraph.edges.groupBy().max("delay")# Output+----------+|max(delay)| +----------+ | 1642| +----------+https://www.iteblog.comDetermining the number of delayed versuson-time/early flightsTo determine the number of delayed versus on-time (or early) flights,you can run the following queries:print "On-time / Early Flights: %d" % tripGraph.edges.filter("delay 0").count()with the results nothing that almost 43% of the flights were delayed!What flights departing Seattle are most likelyto have significant delays?Digging further in this data, let's find out the top five destinations forflights departing from Seattle that are most likely to have significantdelays. This can be achieved through the following query:tripGraph.edges\ .filter("src = 'SEA' and delay > 0")\ .groupBy("src", "dst")\ .avg("delay")\ .sort(desc("avg(delay)"))\ .show(5)As you can see in the following results: Philadelphia (PHL), ColoradoSprings (COS), Fresno (FAT), Long Beach (LGB), and Washington D.C(IAD) are the top five cities with flights delayed originating from Seattle:https://www.iteblog.comWhat states tend to have significant delaysdeparting from Seattle?Let's find which states have the longest cumulative delays (withindividual delays > 100 minutes) originating from Seattle. This time wewill use the display command to review the data:# States with the longest cumulative delays (with individual# delays > 100 minutes) (origin: Seattle)display(tripGraph.edges.filter("src = 'SEA' and delay > 100"))Using the Databricks display command, we can also quickly changefrom this table view to a map view of the data. As can be seen in thehttps://www.iteblog.comfollowing figure, the state with the most cumulative delays originatingfrom Seattle (in this dataset) is California:https://www.iteblog.comUnderstanding vertex degreesWithin the context of graph theory, the degrees around a vertex are thenumber of edges around the vertex. In our flights example, the degreesare then the total number of edges (that is, flights) to the vertex (that is,airports). Therefore, if we were to obtain the top 20 vertex degrees (indescending order) from our graph, then we would be asking for the top20 busiest airports (most flights in and out) from our graph. This can bequickly determined using the following query:display(tripGraph.degrees.sort(desc("degree")).limit(20))Because we're using the display command, we can quickly view a bargraph of this data:Diving into more details, here are the top 20 inDegrees (that is,incoming flights):display(tripGraph.inDegrees.sort(desc("inDegree")).limit(20))https://www.iteblog.comWhile here are the top 20 outDegrees (that is, outgoing flights):display(tripGraph.outDegrees.sort(desc("outDegree")).limit(20))Interestingly, while the top 10 airports (Atlanta/ATL to Charlotte/CLT)are ranked the same for incoming and outgoing flights, the ranks of thenext 10 airports change (for example, Seattle/SEA is 17th for incomingflights, but 18th for outgoing).https://www.iteblog.comDetermining the top transferairportsAn extension of understanding vertex degrees for airports is todetermine the top transfer airports. Many airports are used as transferpoints instead of being the final destination. An easy way to calculatethis is by calculating the ratio of inDegrees (the number of flights to theairport) and / outDegrees (the number of flights leaving the airport).Values close to 1 may indicate many transfers, whereas values 1 indicate many incomingflights.Note that this is a simple calculation that does not consider timing orscheduling of flights, just the overall aggregate number within thedataset:# Calculate the inDeg (flights into the airport) and # outDeg (flights leaving the airport)inDeg = tripGraph.inDegreesoutDeg = tripGraph.outDegrees# Calculate the degreeRatio (inDeg/outDeg)degreeRatio = inDeg.join(outDeg, inDeg.id == outDeg.id) \ .drop(outDeg.id) \ .selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio") \ .cache()# Join back to the 'airports' DataFrame # (instead of registering temp table as above)transferAirports = degreeRatio.join(airports, degreeRatio.id == airports.IATA) \ .selectExpr("id", "city", "degreeRatio") \ .filter("degreeRatio between 0.9 and 1.1")# List out the top 10 transfer city airportsdisplay(transferAirports.orderBy("degreeRatio").limit(10))The output of this query is a bar chart of the top 10 transfer city airportshttps://www.iteblog.com(that is, hub airports):This makes sense since these airports are major hubs for national airlines(for example, Delta uses Minneapolis and Salt Lake City as its hub,Frontier uses Denver, American uses Dallas and Phoenix, United usesHouston, Chicago, and San Francisco, and Hawaiian Airlines usesKahului and Honolulu as its hubs).https://www.iteblog.comUnderstanding motifsTo easily understand the complex relationship of city airports and theflights between each other, we can use motifs to find patterns ofairports (for example, vertices) connected by flights (that is, edges). Theresult is a DataFrame in which the column names are given by the motifkeys. Note that motif finding is one of the new graph algorithmssupported as part of GraphFrames.For example, let's determine the delays that are due to San FranciscoInternational Airport (SFO):# Generate motifsmotifs = tripGraphPrime.find("(a)-[ab]->(b); (b)-[bc]->(c)")\ .filter("(b.id = 'SFO') and (ab.delay > 500 or bc.delay > 500) and bc.tripid > ab.tripid and bc.tripid 500 or bc.delay > 500) denotes that we are limitedto flights that have delays greater than 500 minutes(bc.tripid > ab.tripid and bc.tripid SFO (-4)[1011126]San Francisco (SFO)SFO -> JFK (536)[1021507]New York (JFK)Tuscon (TUS)TUS -> SFO (-5)[1011126]San Francisco (SFO)SFO -> JFK (536)[1021507]New York (JFK)Referring to the TUS > SFO > JFK flight, you will notice that while theflight from Tuscon to San Francisco departed 5 minutes early, the flightfrom San Francisco to New York JFK was delayed by 536 minutes.By using motif finding, you can easily search for structural patterns inhttps://www.iteblog.comyour graph; by using GraphFrames, you are using the power and speedof DataFrames to distribute and perform your query.https://www.iteblog.comDetermining airport rankingusing PageRankBecause GraphFrames is built on top of GraphX, there are severalalgorithms that we can immediately leverage. PageRank waspopularized by the Google Search Engine and created by Larry Page. Toquote Wikipedia:"PageRank works by counting the number and quality of links to apage to determine a rough estimate of how important the websiteis. The underlying assumption is that more important websites arelikely to receive more links from other websites."While the preceding example refers to web pages, this concept readilyapplies to any graph structure whether it is created from web pages, bikestations, or airports. Yet the interface via GraphFrames is as simple ascalling a method. GraphFrames.PageRank will return the PageRankresults as a new column appended to the vertices DataFrame to simplifyour downstream analysis.As there are many flights and connections through the various airportsincluded in this dataset, we can use the PageRank algorithm to haveSpark traverse the graph iteratively to compute a rough estimate of howimportant each airport is:# Determining Airport ranking of importance using 'pageRank'ranks = tripGraph.pageRank(resetProbability=0.15, maxIter=5)# Display the pageRank outputdisplay(ranks.vertices.orderBy(ranks.vertices.pagerank.desc()).limit(20))Note that resetProbability = 0.15 represents the probability ofresetting to a random vertex (this is the default value) while maxIter =5 is a set number of iterations.Tiphttps://www.iteblog.comFor more information on PageRank parameters, please refer toWikipedia > Page Rank at https://en.wikipedia.org/wiki/PageRank.The results of the PageRank are noted in the following bar graph:In terms of airport ranking, the PageRank algorithm has determined thatATL (Hartsfield-Jackson Atlanta International Airport) is the mostimportant airport in the United States. This observation makes sense asATL is not only the busiest airport in the United States(http://bit.ly/2eTGHs4), but it is also the busiest airport in the world(2000-2015) (http://bit.ly/2eTGDsy).https://www.iteblog.comhttps://en.wikipedia.org/wiki/PageRankhttp://bit.ly/2eTGHs4http://bit.ly/2eTGDsyDetermining the most popularnon-stop flightsExpanding upon our tripGraph GraphFrame, the following query willallow us to find the most popular non-stop flights in the US (for thisdataset):# Determine the most popular non-stop flightsimport pyspark.sql.functions as functopTrips = tripGraph \ .edges \ .groupBy("src", "dst") \ .agg(func.count("delay").alias("trips"))# Show the top 20 most popular flights (single city hops)display(topTrips.orderBy(topTrips.trips.desc()).limit(20))Note, while we are using the delay column, we're just actually doing acount of the number of trips. Here's the output:As can be observed from this query, the two most frequent non-stopflights are between LAX (Los Angeles) and SFO (San Francisco). Thehttps://www.iteblog.comfact that these flights are so frequent indicates their importance in theairline market. As noted in the New York Times article from April 4,2016, Alaska Air Sees Virgin America as Key to West Coast(http://nyti.ms/2ea1uZR), acquiring slots at these two airports was oneof the reasons why Alaska Airlines purchased Virgin Airlines. Graphsare not just fun but also contain potentially powerful business insight!https://www.iteblog.comhttp://nyti.ms/2ea1uZRUsing Breadth-First SearchThe Breadth-first search (BFS) is a new algorithm as part ofGraphFrames that finds the shortest path from one set of vertices toanother. In this section, we will use BFS to traverse our tripGraph toquickly find the desired vertices (that is, airports) and edges (that is,flights). Let's try to find the shortest number of connections betweencities based on the dataset. Note that these examples do not considertime or distance, just hopsof map and reduce.However, even though capable of chewing through petabytes of datadaily, MapReduce is a fairly restricted programming framework. Also,most of the tasks require reading and writing to disk. Seeing thesedrawbacks, in 2009 Matei Zaharia started working on Spark as part ofhis PhD. Spark was first released in 2012. Even though Spark is basedon the same MapReduce concept, its advanced ways of dealing withdata and organizing tasks make it 100x faster than Hadoop (for in-memory computations).In this book, we will guide you through the latest incarnation of ApacheSpark using Python. We will show you how to read structured andunstructured data, how to use some fundamental data types available inPySpark, build machine learning models, operate on graphs, readstreaming data, and deploy your models in the cloud. Each chapter willtackle different problem, and by the end of the book we hope you willbe knowledgeable enough to solve other problems we did not havespace to cover here.What this book coversChapter 1, Understanding Spark, provides an introduction into thehttps://www.iteblog.comSpark world with an overview of the technology and the jobsorganization concepts.Chapter 2, Resilient Distributed Datasets, covers RDDs, thefundamental, schema-less data structure available in PySpark.Chapter 3, DataFrames, provides a detailed overview of a data structurethat bridges the gap between Scala and Python in terms of efficiency.Chapter 4, Prepare Data for Modeling, guides the reader through theprocess of cleaning up and transforming data in the Spark environment.Chapter 5, Introducing MLlib, introduces the machine learning librarythat works on RDDs and reviews the most useful machine learningmodels.Chapter 6, Introducing the ML Package, covers the current mainstreammachine learning library and provides an overview of all the modelscurrently available.Chapter 7, GraphFrames, will guide you through the new structure thatmakes solving problems with graphs easy.Chapter 8, TensorFrames, introduces the bridge between Spark and theDeep Learning world of TensorFlow.Chapter 9, Polyglot Persistence with Blaze, describes how Blaze can bepaired with Spark for even easier abstraction of data from varioussources.Chapter 10, Structured Streaming, provides an overview of streamingtools available in PySpark.Chapter 11, Packaging Spark Applications, will guide you through thesteps of modularizing your code and submitting it for execution to Sparkthrough command-line interface.For more information, we have provided two bonus chapters as follows:https://www.iteblog.comInstalling Spark:https://www.packtpub.com/sites/default/files/downloads/InstallingSpark.pdfFree Spark Cloud Offering:https://www.packtpub.com/sites/default/files/downloads/FreeSparkCloudOffering.pdfhttps://www.iteblog.comhttps://www.packtpub.com/sites/default/files/downloads/InstallingSpark.pdfhttps://www.packtpub.com/sites/default/files/downloads/FreeSparkCloudOffering.pdfWhat you need for this bookFor this book you need a personal computer (can be either Windowsmachine, Mac, or Linux). To run Apache Spark, you will need Java 7+and an installed and configured Python 2.6+ or 3.4+ environment; weuse the Anaconda distribution of Python in version 3.5, which can bedownloaded from https://www.continuum.io/downloads.The Python modules we randomly use throughout the book comepreinstalled with Anaconda. We also use GraphFrames andTensorFrames that can be loaded dynamically while starting a Sparkinstance: to load these you just need an Internet connection. It is fine ifsome of those modules are not currently installed on your machine – wewill guide you through the installation process.https://www.iteblog.comhttps://www.continuum.io/downloadsWho this book is forThis book is for everyone who wants to learn the fastest-growingtechnology in big data: Apache Spark. We hope that even the moreadvanced practitioners from the field of data science can find some ofthe examples refreshing and the more advanced topics interesting.https://www.iteblog.comConventionsIn this book, you will find a number of styles of text that distinguishbetween different kinds of information. Here are some examples ofthese styles, and an explanation of their meaning.Code words in text, database table names, folder names, filenames, fileextensions, pathnames, dummy URLs, user input, and Twitter handlesare shown as follows:A block of code is set as follows:data = sc.parallelize( [('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)])When we wish to draw your attention to a particular part of a codeblock, the relevant lines or items are set in bold:rdd1 = sc.parallelize([('a', 1), ('b', 4), ('c',10)])rdd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])rdd3 = rdd1.leftOuterJoin(rdd2)Any command-line input or output is written as follows:java -versionNew terms and important words are shown in bold. Words that yousee on the screen, in menus or dialog boxes for example, appear in thetext like this: "Clicking the Next button moves you to the next screen."NoteWarnings or important notes appear in a box like this.TipTips and tricks appear like this.https://www.iteblog.comReader feedbackFeedback from our readers is always welcome. Let us know what youthink about this book—what you liked or may have disliked. Readerfeedback is important for us to develop titles that you really get the mostout of.To send us general feedback, simply send an e-mail to, and mention the book title via the subject ofyour message.If there is a topic that you have expertise in and you are interested ineither writing or contributing to a book, see our author guide onwww.packtpub.com/authors.https://www.iteblog.commailto:feedback@packtpub.comhttp://www.packtpub.com/authorsCustomer supportNow that you are the proud owner of a Packt book, we have a numberof things to help you to get the most from your purchase.Downloading the example codeYou can download the example code files for all Packt books you havepurchased from your account at http://www.packtpub.com. If youpurchased this book elsewhere, you can visithttp://www.packtpub.com/support and register to have the files e-maileddirectly to you.All the code is also available on GitHub:https://github.com/drabastomek/learningPySpark.You can download the code files by following these steps:1. Log in or register to our website using your e-mail address andpassword.2. Hover the mouse pointer on the SUPPORT tab at the top.3. Click on Code Downloads & Errata.4. Enter the name of the book in the Search box.5. Select the book for which you're looking to download the code files.6. Choose from the drop-down menu where you purchased this bookfrom.7. Click on Code Download.Once the file is downloaded, please make sure that you unzip or extractthe folder using the latest version of:WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for LinuxThe code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Learning-PySpark. We also havehttps://www.iteblog.comhttp://www.packtpub.comhttp://www.packtpub.com/supporthttps://github.com/drabastomek/learningPySparkhttps://github.com/PacktPublishing/Learning-PySparkother code bundles from our rich catalog of books and videos availableat https://github.com/PacktPublishing/. Check them out!Downloading the color images of this bookWe also provide you with a PDF file that has color images of thescreenshots/diagrams used in this book. The color images will help youbetter understand the changes in the output. You can download this filefromhttps://www.packtpub.com/sites/default/files/downloads/LearningPySpark_ColorImages.pdfErrataAlthough we have taken every care to ensure the accuracy of ourcontent, mistakes do happen. If you find abetween cities. For example, to find thenumber of direct flights between Seattle and San Francisco, you can runthe following query:# Obtain list of direct flights between SEA and SFOfilteredPaths = tripGraph.bfs( fromExpr = "id = 'SEA'", toExpr = "id = 'SFO'", maxPathLength = 1)# display list of direct flightsdisplay(filteredPaths)fromExpr and toExpr are the expressions indicating the origin anddestination airports (that is, SEA and SFO, respectively). ThemaxPathLength = 1 indicates that we only want one edge between thetwo vertices, that is, a non-stop flight between Seattle and SanFrancisco. As noted in the following results, there are many direct flightsbetween Seattle and San Francisco:But how about if we want to determine the number of direct flightshttps://www.iteblog.combetween San Francisco and Buffalo? Running the following query willnote that there are no results, that is, no direct flights between the twocities:# Obtain list of direct flights between SFO and BUFfilteredPaths = tripGraph.bfs( fromExpr = "id = 'SFO'", toExpr = "id = 'BUF'", maxPathLength = 1)# display list of direct flightsdisplay(filteredPaths)Once we modify the preceding query to maxPathLength = 2, that is, onelayover, then you will see a lot more flight options:# display list of one-stop flights between SFO and BUFfilteredPaths = tripGraph.bfs( fromExpr = "id = 'SFO'", toExpr = "id = 'BUF'", maxPathLength = 2)# display list of flightsdisplay(filteredPaths)The following table provides an abridged version of the output from thisquery:From Layover ToSFO MSP (Minneapolis) BUFSFO EWR (Newark) BUFSFO JFK (New York) BUFSFO ORD (Chicago) BUFSFO ATL (Atlanta) BUFSFO LAS (Las Vegas) BUFhttps://www.iteblog.comSFO BOS (Boston) BUFBut now that I have my list of airports, how can I determine whichlayover airports are more popular between SFO and BUF? To determinethis, you can now run the following query:# Display most popular layover cities by descending countdisplay(filteredPaths.groupBy("v1.id", "v1.City").count().orderBy(desc("count")).limit(10))The output is shown in the following bar chart:https://www.iteblog.comVisualizing flights using D3To get a powerful and fun visualization of the flight paths andconnections in this dataset, we can leverage the Airports D3visualization (https://mbostock.github.io/d3/talk/20111116/airports.html)within our Databricks notebook. By connecting our GraphFrames,DataFrames, and D3 visualizations, we can visualize the scope of all theflight connections as noted for all on-time or early departing flightswithin this dataset.The blue circles represent the vertices (that is, airports) where the size ofthe circle represents the number of edges (that is, flights) in and out ofthose airports. The black lines are the edges themselves (that is, flights)and their respective connections to the other vertices (that is, airports).Note for any edges that go offscreen, they are representing vertices (thatis, airports) in the states of Hawaii and Alaska.For this to work, we first create a scala package called d3a that isembedded in our notebook (you can download it from here:http://bit.ly/2kPkXkc). Because we're using Databricks notebooks, wecan make Scala calls within our PySpark notebook:%scala// On-time and Early Arrivalsimport d3a._graphs.force( height = 800, width = 1200, clicks = sql("""select src, dst as dest, count(1) as count from departureDelays_geo where delayouter; as notedin the following diagram:Each layer is comprised of one or more nodes with connections (that is,flow of data) between each of these nodes, as noted in the precedingdiagram. Input nodes are passive in that they receive data, but do notmodify the information. The nodes in the hidden and output layers willactively modify the data. For example, the connections from the threenodes in the input layer to one of the nodes in the first hidden layer isnoted in the following diagram:https://www.iteblog.comReferring to a signal processing neural network example, each input(denoted as ) has a weight applied to it ( ), which produces anew value. In this case, one of the hidden nodes ( ) is the result ofthree modified input nodes:There is also a bias applied to the sum in a form of a constant that alsogets adjusted during the training process. The sum (the h 1 in ourexample) passes through so-called activation function that determineshttps://www.iteblog.comthe output of the neuron. Some examples of such activation functionsare presented in the following image:This process is repeated for each node in the hidden layers as well as theoutput layer. The output node is the accumulation of all the weightsapplied to the input values for every active layer node. The learningprocess is the result of many iterations running in parallel, applying andreapplying these weights (in this scenario).Neural networks appear in all the different sizes and shapes. The mostpopular are single- and multi-layer feedforward networks that resemblethe one presented earlier; such structures (even with two layers and oneneuron!) neuron in the output layer are capable of solving simpleregression problems (such as linear and logistic) to highly complexregression and classification tasks (with many hidden layers and ahttps://www.iteblog.comnumber of neurons). Another type commonly used are self-organizingmaps, sometimes referred to as Kohonen networks, due to TeuvoKohonen, a Finnish researcher who first proposed such structures. Thestructures are trained without-a-teacher, that is, they do not require atarget (an unsupervised learning paradigm). Such structures are usedmost commonly to solve clustering problems where the aim is to find anunderlying pattern in the data.NoteFor more information about neural network types, we suggest checkingthis document: http://www.ieee.cz/knihovna/Zhang/Zhang100-ch03.pdfNote that there are many other interesting deep learning libraries inaddition to TensorFlow; including, but not limited, to Theano, Torch,Caffe, Microsoft Cognitive Toolkit (CNTK), mxnet, and DL4J.The need for neural networks and DeepLearningThere are many potential applications with neural networks (and DeepLearning). Some of the more popular ones include facial recognition,handwritten digit identification, game playing, speech recognition,language translation, and object classification. The key aspect here isthat it involves learning and pattern recognition.While neural networks have been around for a long time (at least withinthe context of the history of computer science), they have become morepopular now because of the overarching themes: advances andavailability of distributed computing and advances in research:Advances and availability of distributed computing andhardware: Distributed computing frameworks such as ApacheSpark allows you to complete more training iterations faster bybeing able to run more models in parallel to determine the optimalparameters for your machine learning models. With the prevalenceof GPUs – graphic processing units that were originally designed forhttps://www.iteblog.comhttp://www.ieee.cz/knihovna/Zhang/Zhang100-ch03.pdfdisplaying graphics – these processors are adept at performing theresource intensive mathematical computations required for machinelearning. Together with cloud computing, it becomes easier toharness the power of distributed computing and GPUs due to thelower up-front costs, minimal time to deployment, and easier todeploy elastic scale.Advances in deep learning research: These hardware advanceshave helped return neural networks to the forefront of data scienceswith projects such as TensorFlow as well as other popular ones suchas Theano, Caffe, Torch, Microsoft Cognitive Toolkit (CNTK),mxnet, and DL4J.To dive deeper into these topics, two good references include:Lessons Learned from Deploying Deep Learning at Scale(http://blog.algorithmia.com/deploying-deep-learning-cloud-services/): This blog post by the folks at Algorithmia discuss theirlearnings on deploying deep learning solutions at scale.Neural Networks by Christos Stergio and Dimitrios Siganos(http://bit.ly/2hNSWar): A great primer on neural networks.As noted previously, Deep Learning is part of a family of machinelearning methods based on learning representations of data. In the caseof learning representations, this can also be defined as feature learning.What makes Deep Learning so exciting is that it has the potential toreplace or minimize the need for manual feature engineering. DeepLearning will allow the machine to not just learn a specific task, but alsolearn the features needed for that task. More succinctly, automatingfeature engineering or teaching machines to learn how to learn (a greatreference on feature learning is Stanford's Unsupervised FeatureLearning and Deep Learning tutorial:http://deeplearning.stanford.edu/tutorial/).Breaking these concepts down to the fundamentals, let's start with afeature. As observed in Christopher Bishop's Pattern Recognition andmachine learning (Berlin: Springer. ISBN 0-387-31073-8. 2006) and asnoted in the previous chapters on MLlib and ML, a feature is ahttps://www.iteblog.comhttp://blog.algorithmia.com/deploying-deep-learning-cloud-services/http://bit.ly/2hNSWarhttp://deeplearning.stanford.edu/tutorial/measurable property of the phenomenon being observed.If you are more familiar in the domain of statistics, a feature would be inreference to the independent variables (x1, x2, …, xn) within a stochasticlinear regression model:In this specific example, y is the dependent variable and x i are theindependent variables.Within the context of machine learning scenarios, examples of featuresinclude:Restaurant recommendations: Features include the reviews,ratings, and other content and user profile attributes related to therestaurant. A good example of this model is the Yelp FoodRecommendation System:http://cs229.stanford.edu/proj2013/SawantPai-YelpFoodRecommendationSystem.pdf).Handwritten Digit recognition: Features include block wisehistograms (count of pixels along 2D directions), holes, strokedetection, and so on. Examples include:Handwritten Digit Classification:http://ttic.uchicago.edu/~smaji/projects/digits/Recognizing Handwritten Digits and Characters:http://cs231n.stanford.edu/reports/vishnu_final.pdfImage Processing: Features include the points, edges, and objectswithin the image; some good examples include:Seminar: Feature extraction by André Aichert,http://home.in.tum.de/~aichert/featurepres.pdfUniversity of Washington Computer Science & EngineeringCSE455: Computer Vision Lecture 6,https://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdfhttps://www.iteblog.comhttp://cs229.stanford.edu/proj2013/SawantPai-YelpFoodRecommendationSystem.pdfhttp://ttic.uchicago.edu/~smaji/projects/digits/http://cs231n.stanford.edu/reports/vishnu_final.pdfhttp://home.in.tum.de/~aichert/featurepres.pdfhttps://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdfFeature engineering is about determining which of these features (forexample, in statistics, the independent variables) are important indefining the model that you are creating. Typically, it involves theprocess of using domain knowledge to create the features to allow theML models to work.Coming up with features is difficult, time-consuming,requiresexpert knowledge."Applied machine learning" is basically feature engineering.—Andrew Ng, Machine Learning and AI via Brain simulations(http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10595.pdfWhat is feature engineering?Typically, performing feature engineering involves concepts such asfeature selection (selecting a subset of the original feature set) or featureextraction (building a new set of features from the original feature set):In feature selection, based on domain knowledge, you can filter thevariables that you think define the model (for example, predictingfootball scores based on number of turnovers). Often data analysistechniques such as regression and classification can also be used tohelp you determine this.In feature extraction, the idea is to transform the data from a highdimensional space (that is, many different independent variables) toa smaller space of fewer dimensions. Continuing the footballanalogy, this would be the quarterback rating, which is based onseveral selected features (e.g. completions, touchdowns,interceptions, average gain per pass attempt, etc.). A commonapproach for feature extraction within the linear data transformationspace is principal component analysis (PCA):http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pca. Other commonmechanisms include:Nonlinear dimensionality reduction:https://www.iteblog.comhttp://helper.ipam.ucla.edu/publications/gss2012/gss2012_10595.pdfhttp://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pcahttps://en.wikipedia.org/wiki/Nonlinear_dimensionality_reductionMultilinear subspace learning:https://en.wikipedia.org/wiki/Multilinear_subspace_learningTipA good reference on the topic of feature selection versus featureextraction is What is dimensionality reduction? What is thedifference between feature selection and extraction?http://datascience.stackexchange.com/questions/130/what-is-dimensionality-reduction-what-is-the-difference-between-feature-selecti/132#132Bridging the data and algorithmLet's bridge the feature and feature engineering definitions within thecontext of feature selection using the example of restaurantrecommendations:While this is a simplified model, the analogy describes the basic premiseof applied machine learning. It would be up to a data scientist to analyzethe data to determine the key features of this restauranthttps://www.iteblog.comhttps://en.wikipedia.org/wiki/Nonlinear_dimensionality_reductionhttps://en.wikipedia.org/wiki/Multilinear_subspace_learninghttp://datascience.stackexchange.com/questions/130/what-is-dimensionality-reduction-what-is-the-difference-between-feature-selecti/132#132recommendation model.In our restaurant recommendation case, while it may be easy to presumethat geolocation and cuisine type are major factors, it will require somedigging into the data to understand how the user (that is, restaurant-goer) has chosen their preference for a restaurant. Different restaurantsoften have different characteristics or weights for the mode.For example, the key features for high-end restaurant cateringbusinesses are often related to location (that is, proximity to theircustomer's location), the ability to make reservations for large parties,and the diversity of the wine list:Meanwhile, for specialty restaurants, often few of those previous factorsare involved; instead, the focus is on the reviews, ratings, social mediabuzz, and, possibly whether the restaurant is good for kids:https://www.iteblog.comThe ability to segment these different restaurants (and their targetaudience) is a key facet of applied machine learning. It can be anarduous process where you try different models and algorithms withdifferent variables and weights and then retry after iteratively trainingand testing many different combinations. But note how this timeconsuming iterative approach itself is its own process that canpotentially be automated? This is the key context of building algorithmsof helping machines learn to learn: Deep Learning has the potential toautomating the learning process when building our models.https://www.iteblog.comWhat is TensorFlow?TensorFlow is a Google open source software library for numericalcomputation using data flow graphs; that is, an open source machinelearning library focusing on Deep Learning. Based loosely on neuralnetworks, TensorFlow is the culmination of the work of Google's BrainTeam researchers and engineers to apply Deep Learning to Googleproducts and build production models for various Google teamsincluding (but not limited to) search, photos, and speech.Built on C++ with a Python interface, it has quickly become one ofthe most popular Deep Learning projects in a short amount of time. Thefollowing screenshot denotes a Google Trends comparison between fourpopular deep learning libraries; note the spike around November 8th -14th, 2015 (when TensorFlow was announced) and the rapid rise overthe last year (this snapshot was taken late December 2016):Another way to measure the popularity of TensorFlow is to note thatTensorFlow is the most popular machine learning framework on GitHubhttps://www.iteblog.comper http://www.theverge.com/2016/4/13/11420144/google-machine-learning-tensorflow-upgrade. Note that TensorFlow was only released inNovember 2015 and in two months it had already become the mostpopular forked ML GitHub repository. In the following diagram, you canreview the GitHub Repositories Created in 2015 (InteractiveVisualization) per http://donnemartin.com/viz/pages/2015:As noted previously, TensorFlow performs numerical computation usingdata flow graphs. When thinking about graph (as per the previouschapter on GraphFrames), the node (or vertices) of this graph representmathematical operations while the graph edges represent themultidimensional arrays (that is, tensors) that communicate between thedifferent nodes (that is, mathematical operations).Referring to the following diagram, t 1 is a 2x3 matrix while t 2 is a 3x2https://www.iteblog.comhttp://www.theverge.com/2016/4/13/11420144/google-machine-learning-tensorflow-upgradehttp://donnemartin.com/viz/pages/2015matrix; these are the tensors (or edges of the tensor graph). The node isthe mathematical operations represented as op 1:In this example, op 1 is a matrix multiplication operation represented bythe following diagram, though this could be any of the manymathematics operations available in TensorFlow:Together, to perform your numerical computations within the graph,there is a flow of multidimensional arrays (that is, tensors) betweenthe mathematical operations (nodes) - that is, the flow of tensors, orTensorFlow.To better understand how TensorFlow works, let's start by installingTensorFlow within your Python environment (initially sans Spark). Forthe full instructions, please refer to TensorFlow | Download and Setup:https://www.iteblog.comhttps://www.tensorflow.org/versions/r0.12/get_started/os_setup.html.For this chapter, let's focus on the Python pip package managementsystem installation on Linux or Mac OS.Installing PipEnsure that you have installed pip; if have not, please use the followingcommands to install the Python package installation manager forUbuntu/Linux:# Ubuntu/Linux 64-bit $ sudo apt-get install python-pip python-devFor Mac OS, you would use the following commands:# macOS $ sudo easy_install pip $ sudo easy_install --upgrade sixNote, for Ubuntu/Linux, you may also want to upgrade pip as the pipwithin the Ubuntu repository is old and may not be compatible withnewer packages. To do this, you can run the command:# Ubuntu/Linux pip upgrade$ pip install --upgrade pip Installing TensorFlowTo install TensorFlow (with pip already installed), you only need toexecute the following command:$ pip install tensorflowIfyou have a computer that has GPU support, you can instead use thefollowing command:$ pip install tensorflow-gpuNote that if the preceding command does not work, there are specificinstructions to install TensorFlow with GPU support based on yourhttps://www.iteblog.comhttps://www.tensorflow.org/versions/r0.12/get_started/os_setup.htmlPython version (that is, 2.7, 3.4, or 3.5) and GPU support.For example, if I wanted to install TensorFlow on Python 2.7 with GPUenabled on Mac OS, execute the following commands:# macOS, GPU enabled, Python 2.7: $ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/gpu/tensorflow_gpu-0.12.0rc1-py2-none-any.whl# Python 2 $ sudo pip install --upgrade $TF_BINARY_URLNotePlease refer tohttps://www.tensorflow.org/versions/r0.12/get_started/os_setup.html forthe latest installation instructions.Matrix multiplication using constantsTo better describe tensors and how TensorFlow works, let's start with amatrix multiplication calculation involving two constants. As noted inthe following diagram, we have c 1 (3x1 matrix) and c 2 (1x3 matrix),where the operation (op 1) is a matrix multiplication:https://www.iteblog.comhttps://www.tensorflow.org/versions/r0.12/get_started/os_setup.htmlWe will now define c 1 (1x3 matrix) and c 2 (3x1 matrix) using thefollowing code:# Import TensorFlowimport tensorflow as tf# Setup the matrix# c1: 1x3 matrix# c2: 3x1 matrixc1 = tf.constant([[3., 2., 1.]])c2 = tf.constant([[-1.], [2.], [1.]])Now that we have our constants, let's run our matrix multiplication usingthe following code. Within the context of a TensorFlow graph, recall thatthe nodes in the graph are called operations (or ops). The followingmatrix multiplication is the ops, while the two matrices (c 1, c 2) are thetensors (typed multi-dimensional array). An op takes zero or moretensors as its input, performs the operation such as a mathematicalhttps://www.iteblog.comcalculation, with the output being zero or more tensors in the form ofnumpy ndarray objects (http://www.numpy.org/) or tensorflow::Tensorinterfaces in C, C++:# m3: matrix multiplication (m1 x m3)mp = tf.matmul(c1, c2)Now that this TensorFlow graph has been established, execution of thisoperation (for example, in this case, the matrix multiplication) is donewithin the context of a session; the session places the graph ops intothe CPU or GPU (that is, devices) to be executed:# Launch the default graphs = tf.Session()# run: Execute the ops in graphr = s.run(mp)print(r)With the output being:# [[ 2.]]Once you have completed your operations, you can close the session:# Close the Session when completeds.close()Matrix multiplication using placeholdersNow we will perform the same task as before, except this time, we willuse tensors instead of constants. As noted in the following diagram, wewill start off with two matrices (m1: 3x1, m2: 1x3) using the samevalues as in the previous section:https://www.iteblog.comhttp://www.numpy.org/Within TensorFlow, we will use placeholder to define our two tensors asper the following code snippet:# Setup placeholder for your model# t1: placeholder tensor# t2: placeholder tensort1 = tf.placeholder(tf.float32)t2 = tf.placeholder(tf.float32)# t3: matrix multiplication (m1 x m3)tp = tf.matmul(t1, t2)The advantage of this approach is that, with placeholders you can usethe same operations (that is, in this case, the matrix multiplication) withhttps://www.iteblog.comtensors of different sizes and shape (provided they meet the criteria ofthe operation). Like the operations in the previous section, let'sdefine two matrices and execute the graph (with a simplified sessionexecution).Running the modelThe following code snippet is similar to the code snippet in the previoussection, except that it now uses placeholders instead of constants:# Define input matricesm1 = [[3., 2., 1.]]m2 = [[-1.], [2.], [1.]]# Execute the graph within a sessionwith tf.Session() as s: print(s.run([tp], feed_dict={t1:m1, t2:m2}))With the output being both the value, as well as the data type:[array([[ 2.]], dtype=float32)]Running another modelNow that we have a graph (albeit a simple one) using placeholders, wecan use different tensors to perform the same operation using differentinput matrices. As noted in the following figure, we have m1 (4x1) andm2 (1x4):https://www.iteblog.comBecause we're using placeholders, we can easily reuse the same graphwithin a new session using new input:# setup input matricesm1 = [[3., 2., 1., 0.]]m2 = [[-5.], [-4.], [-3.], [-2.]]# Execute the graph within a sessionwith tf.Session() as s: print(s.run([tp], feed_dict={t1:m1, t2:m2}))With the output being:[array([[-26.]], dtype=float32)]Discussionhttps://www.iteblog.comAs noted previously, TensorFlow provides users with the ability toperform deep learning using Python libraries by representingcomputations as graphs where the tensors represent the data (edges ofthe graph) and operations represent what is to be executed (for example,mathematical computations) (vertices of the graph).For more information, please refer to:TensorFlow | Get Started | Basic Usagehttps://www.tensorflow.org/get_started/get_started#basic_usageShannon McCormick's Neural Network and Google TensorFlowhttp://www.slideshare.net/ShannonMcCormick4/neural-networks-and-google-tensor-flowhttps://www.iteblog.comhttps://www.tensorflow.org/get_started/get_started#basic_usagehttp://www.slideshare.net/ShannonMcCormick4/neural-networks-and-google-tensor-flowIntroducing TensorFramesAt the time of writing, TensorFrames is an experimental binding forApache Spark; it was introduced in early 2016, shortly after the releaseof TensorFlow. With TensorFrames, one can manipulate SparkDataFrames with TensorFlow programs. Referring to the tensor diagramsin the previous section, we have updated the figure to show how SparkDataFrames work with TensorFlow, as shown in the following diagram:https://www.iteblog.comAs noted in the preceding diagram, TensorFrames provides a bridgebetween Spark DataFrames and TensorFlow. This allows you to takeyour DataFrames and apply them as input into your TensorFlowcomputation graph. TensorFrames also allows you to take thehttps://www.iteblog.comTensorFlow computation graph output and push it back into DataFramesso you can continue your downstream Spark processing.In terms of common usage scenarios for TensorFrames, thesetypically include the following:Utilize TensorFlow with your dataThe integration of TensorFlow and Apache Spark with TensorFramesallows data scientists to expand their analytics, streaming, graph, andmachine learning capabilities to include Deep Learning via TensorFlow.This allows you to both train and deploy models at scale.Parallel training to determine optimal hyperparametersWhen building deep learning models, there are several configurationparameters (that is, hyperparameters) that impact on how the model istrained. Common in Deep Learning/artificial neural networks arehyperparameters that define the learning rate (if the rate is high it willlearn quickly, but it may not take into account highly variable input -that is, it will not learn well if the rate and variability in the data is toohigh) and the number of neurons in each layer of your neural network(too many neurons results in noisy estimates, while too few neurons willresult in the network not learning well).As observed in Deep Learning with Apache Spark and TensorFlow(https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html), using Spark with TensorFlow to help findthe best set of hyperparameters for neural network training resulted inan order of magnitude reduction in training time and a 34% lower errorrate for the handwritten digit recognition dataset.For more information on Deep Learning and hyperparameters,pleaserefer to:Optimizing Deep Learning Hyper-Parameters Through anEvolutionary Algorithmhttp://ornlcda.github.io/MLHPC2015/presentations/4-Steven.pdfhttps://www.iteblog.comhttps://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htmlhttp://ornlcda.github.io/MLHPC2015/presentations/4-Steven.pdfCS231n Convolutional Network Networks for Visual Recognitionhttp://cs231n.github.io/Deep Learning with Apache Spark and TensorFlowhttps://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htmlAt the time of writing, TensorFrames is officially supported as ofApache Spark 1.6 (Scala 2.10), though most contributions are currentlyfocused on Spark 2.0 (Scala 2.11). The easiest way to use TensorFramesis to access it via Spark Packages (https://spark-packages.org).https://www.iteblog.comhttp://cs231n.github.io/https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htmlhttps://spark-packages.orgTensorFrames – quick startAfter all this preamble, let's jump start our use of TensorFrames with thisquick start tutorial. You can download and use the full notebook withinDatabricks Community Edition at http://bit.ly/2hwGyuC.You can also run this from the PySpark shell (or other Sparkenvironments), like any other Spark package:# The version we're using in this notebook$SPARK_HOME/bin/pyspark --packages tjhunter:tensorframes:0.2.2-s_2.10 # Or use the latest version $SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.3-s_2.10Note, you will only use one of the above commands (that is, not both).For more information, please refer to the databricks/tensorframesGitHub repository (https://github.com/databricks/tensorframes).Configuration and setupPlease follow the configuration and setup steps in the following order:Launching a Spark clusterLaunch a Spark cluster using Spark 1.6 (Hadoop 1) and Scala 2.10. Thishas been tested with Spark 1.6, Spark 1.6.2, and Spark 1.6.3 (Hadoop 1)on Databricks Community Edition (http://databricks.com/try-databricks).Creating a TensorFrames libraryCreate a library to attach TensorFrames 0.2.2 to your cluster:tensorframes-0.2.2-s_2.10. Refer to Chapter 7, GraphFrames torecall how to create a library.Installing TensorFlow on your clusterhttps://www.iteblog.comhttp://bit.ly/2hwGyuChttps://github.com/databricks/tensorframeshttp://databricks.com/try-databricksIn a notebook, run one of the following commands to install TensorFlow.This has been tested with TensorFlow 0.9 CPU edition:TensorFlow 0.9, Ubuntu/Linux 64-bit, CPU only, Python 2.7:/databricks/python/bin/pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whlTensorFlow 0.9, Ubuntu/Linux 64-bit, GPU enabled, Python 2.7:/databricks/python/bin/pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whlThe following is the pip install command that will install TensorFlow onto the Apache Spark driver:%sh/databricks/python/bin/pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whlA successful installation should have something similar to the followingoutput:Collecting tensorflow==0.9.0rc0 from https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl Downloading https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl (27.6MB) Requirement already satisfied (use --upgrade to upgrade): numpy>=1.8.2 in /databricks/python/lib/python2.7/site-packages (from tensorflow==0.9.0rc0) Requirement already satisfied (use --upgrade to upgrade): six>=1.10.0 in /usr/lib/python2.7/dist-packages (from tensorflow==0.9.0rc0) Collecting protobuf==3.0.0b2 (from tensorflow==0.9.0rc0) Downloading protobuf-3.0.0b2-py2.py3-none-any.whl (326kB) Requirement already satisfied (use --upgrade to upgrade): wheel in /databricks/python/lib/python2.7/site-packages (from tensorflow==0.9.0rc0) Requirement already satisfied (use --upgrade to upgrade): setuptools in /databricks/python/lib/python2.7/site-packages (from protobuf==3.0.0b2->tensorflow==0.9.0rc0) Installing collected packages: protobuf, tensorflow Successfully installed protobuf-https://www.iteblog.com3.0.0b2 tensorflow-0.9.0rc0Upon successful installation of TensorFlow, detach and reattach thenotebook where you just ran this command. Your cluster is nowconfigured; you can run pure TensorFlow programs on the driver, orTensorFrames examples on the whole cluster.Using TensorFlow to add a constant to anexisting columnThis is a simple TensorFrames program where the op is to perform asimple addition. Note that the original source code can be found in thedatabricks/tensorframes GitHub repository. This is in reference to theTensorFrames Readme.md | How to Run in Python section(https://github.com/databricks/tensorframes#how-to-run-in-python).The first thing we will do is import TensorFlow, TensorFrames, andpyspark.sql.row to create a DataFrame based on an RDD of floats:# Import TensorFlow, TensorFrames, and Rowimport tensorflow as tfimport tensorframes as tfsfrom pyspark.sql import Row# Create RDD of floats and convert into DataFrame `df`rdd = [Row(x=float(x)) for x in range(10)]df = sqlContext.createDataFrame(rdd)To view the df DataFrame generated by the RDD of floats, we can usethe show command:df.show()This produces the following result:https://www.iteblog.comhttps://github.com/databricks/tensorframes#how-to-run-in-pythonExecuting the Tensor graphAs noted previously, this tensor graph consists of adding 3 to the tensorcreated by the df DataFrame generated by the RDD of floats. We willnow execute the following code snippet:# Run TensorFlow program executes:# The 'op' performs the addition (i.e. 'x' + '3')# Place the data back into a DataFramewith tf.Graph().as_default() as g:# The placeholder that corresponds to column 'x'.# The shape of the placeholder is automatically# inferred from the DataFrame. x = tfs.block(df, "x") # The output that adds 3 to x z = tf.add(x, 3, name='z') # The resulting `df2` DataFrame df2 = tfs.map_blocks(z, df)# Note that 'z' is the tensor output from the# 'tf.add' operationprint z## Outputhttps://www.iteblog.comTensor("z:0", shape=(?,), dtype=float64)Here are some specific call outs for the preceding code snippet:x utilizes tfs.block, where the block builds a block placeholderbased on the content of a column in a DataFramez is the output tensor from the TensorFlow add method (tf.add)df2 is the new DataFrame, which adds an extra column to the dfDataFrame with the z tensor block by blockWhile z is the tensor (as noted in the preceding output), for us to workwith the results of the TensorFlow program, we will utilize the df2dataframe. The output from df2.show() is as follows:Blockwise reducing operations exampleIn this next section, we will show how to work with blockwise reducingoperations. Specifically, we will compute the sum and min of fieldvectors, working with blocks of rows for more efficient processing.Building a DataFrame of vectorsFirst, we will create a one-column DataFrame of vectors:# Build a DataFrame of vectorshttps://www.iteblog.comdata = [Row(y=[float(y), float(-y)]) for y in range(10)]df = sqlContext.createDataFrame(data)df.show()The output is as follows:Analysing the DataFrameWe need to analyze the DataFrame to determine what its shape is (thatis, dimensions of the vectors). For example, in the following snippet, weuse the tfs.print_schema command for the df DataFrame:# Print the information gathered by TensorFlow to check the content of the DataFrametfs.print_schema(df)## Outputroot |-- y: array (nullable = true) double[?,?]Notice the double[?,?], meaning that TensorFlow does not knowthedimensions of the vectors:# Because the dataframe contains vectors, we need to analyze it# first to find the dimensions of the vectors.df2 = tfs.analyze(df)# The information gathered by TF can be printed https://www.iteblog.com# to check the content:tfs.print_schema(df2)## Outputroot |-- y: array (nullable = true) double[?,2] Upon analysis of the df2 DataFrame, TensorFlow has inferred that ycontains vectors of size 2. For small tensors (scalars and vectors),TensorFrames usually infers the shapes of the tensors without requiring apreliminary analysis. If it cannot do so, an error message will indicatethat you need to run the DataFrame through tfs.analyze() first.Computing elementwise sum and min of all vectorsNow, let's analyze the df DataFrame to compute the sum and theelementwise min of all the vectors using tf.reduce_sum andtf.reduce_min:tf.reduce_sum: Computes the sum of elements across dimensions ofa tensor, for example, if x = [[3, 2, 1], [-1, 2, 1]] thentf.reduce_sum(x) ==> 8. More information can be found at:https://www.tensorflow.org/api_docs/python/tf/reduce_sum.tf.reduce_min: Computes the minimum of elements acrossdimensions of a tensor, for example, if x = [[3, 2, 1], [-1, 2,1]] then tf.reduce_min(x) ==> -1. More information can be foundat: https://www.tensorflow.org/api_docs/python/tf/reduce_min.The following code snippet allows us to perform efficient elementwisereductions using TensorFlow, where the source data is within aDataFrame:# Note: First, let's make a copy of the 'y' column. # This is an inexpensive operation in Spark 2.0+df3 = df2.select(df2.y, df2.y.alias("z"))# Execute the Tensor Graphwith tf.Graph().as_default() as g: # The placeholders. # Note the special name that end with '_input':https://www.iteblog.com y_input = tfs.block(df3, 'y', tf_name="y_input") z_input = tfs.block(df3, 'z', tf_name="z_input") # Perform elementwise sum and minimum y = tf.reduce_sum(y_input, [0], name='y') z = tf.reduce_min(z_input, [0], name='z') # The resulting dataframe(data_sum, data_min) = tfs.reduce_blocks([y, z], df3)# The finalresults are numpy arrays:print "Elementwise sum: %s and minimum: %s " % (data_sum, data_min)## OutputElementwise sum: [ 45. -45.] and minimum: [ 0. -9.] With a few lines of TensorFlow code with TensorFrames, we can takethe data stored within the df DataFrame and execute a Tensor Graph toperform element wise sum and min, merge the data back into aDataFrame, and (in our case) print out the final values.https://www.iteblog.comSummaryIn this chapter, we have reviewed the fundamentals of neural networksand Deep Learning, including the components of feature engineering.With all this new excitement in Deep Learning, we introducedTensorFlow and how it can work closely together with Apache Sparkthrough TensorFrames.TensorFrames is a powerful deep learning tool that allows data scientistsand engineers to work with TensorFlow with data stored in SparkDataFrames. This allows you to expand the capabilities of Apache Sparkto a powerful deep learning toolset that is based on the learning processof neural networks. To help continue your Deep Learning journey, thefollowing are some great TensorFlow and TensorFrames resources:TensorFlow: https://www.tensorflow.org/TensorFlow | Get Started:https://www.tensorflow.org/get_started/get_startedTensorFlow | Guides: https://www.tensorflow.org/tutorials/Deep Learning on Databricks:https://databricks.com/blog/2016/12/21/deep-learning-on-databricks.htmlTensorFrames (GitHub): https://github.com/databricks/tensorframesTensorFrames User Guide:https://github.com/databricks/tensorframes/wiki/TensorFrames-user-guidehttps://www.iteblog.comhttps://www.tensorflow.org/https://www.tensorflow.org/get_started/get_startedhttps://www.tensorflow.org/tutorials/https://databricks.com/blog/2016/12/21/deep-learning-on-databricks.htmlhttps://github.com/databricks/tensorframeshttps://github.com/databricks/tensorframes/wiki/TensorFrames-user-guideChapter 9. Polyglot Persistencewith BlazeOur world is complex and no single approach exists that solves allproblems. Likewise, in the data world one cannot solve all problemswith one piece of technology.Nowadays, any big technology company uses (in one form or another) aMapReduce paradigm to sift through terabytes (or even petabytes) ofdata collected daily. On the other hand, it is much easier to store,retrieve, extend, and update information about products in a document-type database (such as MongoDB) than it is in a relational database. Yet,persisting transaction records in a relational database aids later datasummarizing and reporting.Even these simple examples show that solving a vast array of businessproblems requires adapting to different technologies. This means thatyou, as a database manager, data scientist, or data engineer, would haveto learn all of these separately if you were to solve your problems withthe tools that are designed to solve them easily. This, however, does notmake your company agile and is prone to errors and lots of tweaking andhacking needing to be done to your system.Blaze abstracts most of the technologies and exposes a simple andelegant data structure and API.In this chapter, you will learn:How to install BlazeWhat polyglot persistence is aboutHow to abstract data stored in files, pandas DataFrames, or NumPyarraysHow to work with archives (GZip)How to connect to SQL (PostgreSQL and SQLite) and No-SQL(MongoDB) databases with Blazehttps://www.iteblog.comHow to query, join, sort, and transform the data, and perform simplesummary statisticsInstalling BlazeIf you run Anaconda it is easy to install Blaze. Just issue the followingcommand in your CLI (see the Bonus Chapter 1, Installing Spark if youdo not know what a CLI is):conda install blazeOnce the command is issued, you will see a screen similar to thefollowing screenshot:We will later use Blaze to connect to the PostgreSQL and MongoDBdatabases, so we need to install some additional packages that Blaze willuse in the background.We will install SQL Alchemy and PyMongo, both of which are part ofAnaconda:conda install sqlalchemyconda install pymongoAll that is now left to do is to import Blaze itself in our notebook:import blaze as blhttps://www.iteblog.comPolyglot persistenceNeal Ford introduced the, somewhat similar, term polyglot programmingin 2006. He used it to illustrate the fact that there is no such thing as aone-size-fits-all solution and advocated using multiple programminglanguages that were more suitable for certain problems.In the parallel world of data, any business that wants to remaincompetitive needs to adapt a range of technologies that allows it to solvethe problems in a minimal time, thus minimizing the costs.Storing transactional data in Hadoop files is possible, but makes littlesense. On the other hand, processing petabytes of Internet logs using aRelational Database Management System (RDBMS) would also beill-advised. These tools were designed to tackle specific types of tasks;even though they can be co-opted to solve other problems, the cost ofadapting the tools to do so would be enormous. It is a virtual equivalentof trying to fit a square peg in a round hole.For example, consider a company that sells musical instruments andaccessories online (and in a network of shops). At a high-level, there area number of problems that a company needs to solve to be successful:1. Attract customers to its stores (both virtual and physical).2. Present them with relevant products (you would not try to sell adrum kit to a pianist, would you?!).3. Once they decide to buy, process the payment and organizeshipping.To solve these problems a company might choose from a number ofavailable technologies that were designed to solve these problems:1. Store all the products in a document-based database suchasMongoDB, Cassandra, DynamoDB, or DocumentDB. There aremultiple advantages of document databases: flexible schema,sharding (breaking bigger databases into a set of smaller, moremanageable ones), high availability, and replication, among others.https://www.iteblog.com2. Model the recommendations using a graph-based database (such asNeo4j, Tinkerpop/Gremlin, or GraphFrames for Spark): suchdatabases reflect the factual and abstract relationships betweencustomers and their preferences. Mining such a graph is invaluableand can produce a more tailored offering for a customer.3. For searching, a company might use a search-tailored solution suchas Apache Solr or ElasticSearch. Such a solution provides fast,indexed text searching capabilities.4. Once a product is sold, the transaction normally has a well-structured schema (such as product name, price, and so on.) To storesuch data (and later process and report on it) relational databasesare best suited.With polyglot persistence, a company always chooses the right tool forthe right job instead of trying to coerce a single technology into solvingall of its problems.Blaze, as we will see, abstracts these technologies and introduces asimple API to work with, so you do not have to learn the APIs of eachand every technology you want to use. It is, in essence, a great workingexample of polyglot persistence.NoteTo see how others do it, check outhttp://www.slideshare.net/Couchbase/couchbase-at-ebay-2014orhttp://www.slideshare.net/bijoor1/case-study-polyglotpersistence-in-pharmaceutical-industry.https://www.iteblog.comhttp://www.slideshare.net/Couchbase/couchbase-at-ebay-2014https://www.slideshare.net/bijoor1/case-study-polyglot-persistence-in-pharmaceutical-industryAbstracting dataBlaze can abstract many different data structures and expose a single,easy-to-use API. This helps to get a consistent behavior and reduce theneed to learn multiple interfaces to handle data. If you know pandas,there is not really that much to learn, as the differences in the syntax aresubtle. We will go through some examples to illustrate this.Working with NumPy arraysGetting data from a NumPy array into the DataShape object of Blaze isextremely easy. First, let's create a simple NumPy array: we first loadNumPy and then create a matrix with two rows and three columns:import numpy as npsimpleArray = np.array([ [1,2,3], [4,5,6] ])Now that we have an array, we can abstract it with Blaze's DataShapestructure:simpleData_np = bl.Data(simpleArray)That's it! Simple enough.In order to peek inside the structure you can use the .peek() method:simpleData_np.peek()You should see an output similar to what is shown in the followingscreenshot:https://www.iteblog.comYou can also use (familiar to those of you versed in pandas' syntax) the.head(...) method.NoteThe difference between .peek() and .head(...) is that .head(...)allows the specification of the number of rows as its only parameter,whereas .peek() does not allow that and will always print the top 10records.If you want to retrieve the first column of your DataShape, you can useindexing:simpleData_np[0]You should see a table, as shown here:On the other hand, if you were interested in retrieving a row, all youwould have to do (like in NumPy) is transpose your DataShape:simpleData_np.T[0]What you will then get is presented in the following figure:https://www.iteblog.comNotice that the name of the column is None. DataShapes, just likepandas' DataFrames, support named columns. Thus, let's specify thenames of our fields:simpleData_np = bl.Data(simpleArray, fields=['a', 'b', 'c'])Now you can retrieve the data simply by calling the column by its name:simpleData_np['b']In return, you will get the following output:As you can see, defining the fields transposes the NumPy array and,now, each element of the array forms a row, unlike when we firstcreated the simpleData_np.Working with pandas' DataFramehttps://www.iteblog.comSince pandas' DataFrame internally uses NumPy data structures,translating a DataFrame to DataShape is effortless.First, let's create a simple DataFrame. We start by importing pandas:import pandas as pdNext, we create a DataFrame:simpleDf = pd.DataFrame([ [1,2,3], [4,5,6] ], columns=['a','b','c'])We then transform it into a DataShape:simpleData_df = bl.Data(simpleDf)You can retrieve data in the same manner as with the DataShape createdfrom the NumPy array. Use the following command:simpleData_df['a']Then, it will produce the following output:Working with filesA DataShape object can be created directly from a .csv file. In thisexample, we will use a dataset that consists of 404,536 traffic violationsthat happened in the Montgomery county of Maryland.https://www.iteblog.comNoteWe downloaded the data from https://catalog.data.gov/dataset/traffic-violations-56dda on 8/23/16; the dataset is updated daily, so the numberof traffic violations might differ if you retrieve the dataset at a later date.We store the dataset in the ../Data folder locally. However, wemodified the dataset slightly so we could store it in the MongoDB: in itsoriginal form, with date columns, reading data back from MongoDBcaused errors. We filed a bug with Blaze to fix this issuehttps://github.com/blaze/blaze/issues/1580:import odotraffic = bl.Data('../Data/TrafficViolations.csv')If you do not know the names of the columns in any dataset, you can getthese from the DataShape. To get a list of all the fields, you can use thefollowing command:print(traffic.fields)TipThose of you familiar with pandas would easily recognize the similaritybetween the .fields and .columns attributes, as these work inessentially the same way - they both return the list of columns (in thecase of pandas DataFrame), or the list of fields, as columns are called inthe case of Blaze DataShape.https://www.iteblog.comhttps://catalog.data.gov/dataset/traffic-violations-56ddahttps://github.com/blaze/blaze/issues/1580Blaze can also read directly from a GZipped archive, saving space:traffic_gz = bl.Data('../Data/TrafficViolations.csv.gz')To validate that we get exactly the same data, let's retrieve the first tworecords from each structure. You can either call the following:traffic.head(2)Or you can choose to call:traffic_gz.head(2)It produces the same results (columns abbreviated here):It is easy to notice, however, that it takes significantly more time toretrieve the data from the archived file because Blaze needs todecompress the data.You can also read from multiple files at one time and create one bigdataset. To illustrate this, we have split the original dataset into fourGZipped datasets by year of violation (these are stored in the../Data/Years folder).Blaze uses odo to handle saving DataShapes to a variety of formats. Tohttps://www.iteblog.comsave traffic data for traffic violations by year you can call odo like this:import odofor year in traffic.Stop_year.distinct().sort(): odo.odo(traffic[traffic.Stop_year == year], '../Data/Years/TrafficViolations_{0}.csv.gz'\ .format(year))The preceding instruction saves the data into a GZip archive, but you cansave it to any of the formats mentioned earlier. The first argument to the.odo(...) method specifies the input object (in our case, the DataShapewith traffic violations that occurred in 2013), the second argument is theoutput object - the path to the file we want to save the data to. As weare about to learn - storing data is not limited to files only.To read from multiple files you can use the asterisk character *:traffic_multiple = bl.Data( '../Data/Years/TrafficViolations_*.csv.gz')traffic_multiple.head(2)The preceding snippet, once again, will produce a familiar table:Blaze reading capabilities are not limited to .csv or GZip files only: youcan readdata from JSON or Excel files (both, .xls and .xlsx), HDFS,or bcolz formatted files.https://www.iteblog.comTipTo learn more about the bcolz format, check its documentation athttps://github.com/Blosc/bcolz.Working with databasesBlaze can also easily read from SQL databases such as PostgreSQL orSQLite. While SQLite would normally be a local database, thePostgreSQL can be run either locally or on a server.Blaze, as mentioned earlier, uses odo in the background to handle thecommunication to and from the databases.Noteodo is one of the requirements for Blaze and it gets installed along withthe package. Check it out here https://github.com/blaze/odo.In order to execute the code in this section, you will need two things: arunning local instance of a PostgreSQL database, and a locally runningMongoDB database.TipIn order to install PostgreSQL, download the package fromhttp://www.postgresql.org/download/ and follow the installationinstructions for your operating system found there.To install MongoDB, go to https://www.mongodb.org/downloads anddownload the package; the installation instructions can be found herehttp://docs.mongodb.org/manual/installation/.Before you proceed, we assume that you have a PostgreSQL databaseup and running at http://localhost:5432/, and MongoDB databaserunning at http://localhost:27017.We have already loaded the traffic data to both of the databases andhttps://www.iteblog.comhttps://github.com/Blosc/bcolzhttps://github.com/blaze/odohttp://www.postgresql.org/download/https://www.mongodb.org/downloadshttp://docs.mongodb.org/manual/installation/stored them in the traffic table (PostgreSQL) or the traffic collection(MongoDB).TipIf you do not know how to upload your data, I have explained this in myother book https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook.Interacting with relational databasesLet's read the data from the PostgreSQL database now. The UniformResource Identifier (URI) for accessing a PostgreSQL database has thefollowing syntax postgresql://:@:/::.To read the data from PostgreSQL, you just wrap the URI around.Data(...) - Blaze will take care of the rest:traffic_psql = bl.Data( 'postgresql://{0}:{1}@localhost:5432/drabast::traffic'\ .format('', ''))We use Python's .format(...) method to fill in the string with theappropriate data.TipSubstitute your credentials to access your PostgreSQL database in theprevious example. If you want to read more about the .format(...)method, you can check out the Python 3.5 documentationhttps://docs.python.org/3/library/string.html#format-string-syntax.It is quite easy to output the data to either the PostgreSQL or SQLitedatabases. In the following example, we will output traffic violationsthat involved cars manufactured in 2016 to both PostgreSQL and SQLitedatabases. As previously noted, we will use odo to manage the transfers:traffic_2016 = traffic_psql[traffic_psql['Year'] == 2016]https://www.iteblog.comhttps://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbookhttps://docs.python.org/3/library/string.html#format-string-syntax# Drop commands# odo.drop('sqlite:///traffic_local.sqlite::traffic2016')# odo.drop('postgresql://{0}:{1}@localhost:5432/drabast::traffic'\.format('', ''))# Save to SQLiteodo.odo(traffic_2016,'sqlite:///traffic_local.sqlite::traffic2016')# Save to PostgreSQLodo.odo(traffic_2016, 'postgresql://{0}:{1}@localhost:5432/drabast::traffic'\ .format('', ''))In a similar fashion to pandas, to filter the data, we effectively select theYear column (the traffic_psql['Year'] part of the first line) and createa Boolean flag by checking whether each and every record in thatcolumn equals 2016. By indexing the traffic_psql object with such atruth vector, we extract only the records where the corresponding valueequals True.The two commented out lines should be uncommented if you alreadyhave the traffic2016 tables in your databases; otherwise odo willappend the data to the end of the table.The URI for SQLite is slightly different than the one for PostgreSQL; ithas the following syntax sqlite://::.Reading data from the SQLite database should be trivial for you by now:traffic_sqlt = bl.Data( 'sqlite:///traffic_local.sqlite::traffic2016')Interacting with the MongoDB databaseMongoDB has gained lots of popularity over the years. It is a simple,fast, and flexible document-based database. The database is a go-tostorage solution for all full-stack developers, using the MEAN.js stack: Mhere stands for Mongo (see http://meanjs.org).https://www.iteblog.comhttp://meanjs.orgSince Blaze is meant to work in a very familiar way no matter what yourdata source, reading from MongoDB is very similar to reading fromPostgreSQL or SQLite databases:traffic_mongo = bl.Data( 'mongodb://localhost:27017/packt::traffic')https://www.iteblog.comData operationsWe have already presented some of the most common methods you willuse with DataShapes (for example, .peek()), and ways to filter the databased on the column value. Blaze has implemented many methods thatmake working with any data extremely easy.In this section, we will review a host of other commonly used ways ofworking with data and methods associated with them. For those of youcoming from pandas and/or SQL, we will provide a respective syntaxwhere equivalents exist.Accessing columnsThere are two ways of accessing columns: you can get a single columnat a time by accessing them as if they were a DataShape attribute:traffic.Year.head(2)The preceding script produces the following output:You can also use indexing that allows the selection of more than onecolumn at a time:(traffic[['Location', 'Year', 'Accident', 'Fatal', 'Alcohol']] .head(2))This generates the following output:https://www.iteblog.comThe preceding syntax would be the same for pandas DataFrames. Forthose of you unfamiliar with Python and pandas API, please note threethings here:1. To specify multiple columns, you need to enclose them in anotherlist: note the double brackets [[ and ]].2. If the chain of all methods does not fit on one line (or you want tobreak the chain for better readability) you have two choices: eitherenclose the whole chain of methods in brackets (...) where the ...is the chain of all methods, or, before breaking into the new line, putthe backslash character \ at the end of every line in the chain. Weprefer the latter and will use that in our examples from now on.3. Note that the equivalent SQL code would be:SELECT *FROM trafficLIMIT 2Symbolic transformationsThe beauty of Blaze comes from the fact that it can operatesymbolically. What this means is that you can specify transformations,filters, or other operations on your data and store them as object(s). Youcan then feed such object with almost any form of data conforming tothe original schema, and Blaze will return the transformed data.For example, let's select all the traffic violations that occurred in 2013,and return only the 'Arrest_Type', 'Color', and 'Charge` columns.First, if we could not reflect the schema from an already existing object,we would have to specify the schema manually. To do this, we will usehttps://www.iteblog.comthe .symbol(...) method to achieve that; the first argument to themethod specifies a symbolic name of the transformation (we preferkeeping it the same as the name of the object, but it can be anything),and the second argument is a long string that specifies the schema in a: fashion, separated by commas:schema_example = bl.symbol('schema_exampl', '{id: int, name: string}')Now, you coulduse the schema_example object and specify sometransformations. However, since we already have an existing trafficdataset, we can reuse the schema by using traffic.dshape andspecifying our transformations:traffic_s = bl.symbol('traffic', traffic.dshape)traffic_2013 = traffic_s[traffic_s['Stop_year'] == 2013][ ['Stop_year', 'Arrest_Type','Color', 'Charge']]To present how this works, let's read the original dataset into pandas'DataFrame:traffic_pd = pd.read_csv('../Data/TrafficViolations.csv')Once read, we pass the dataset directly to the traffic_2013 object andperform the computation using the .compute(...) method of Blaze; thefirst argument to the method specifies the transformation object (ours istraffic_2013) and the second parameter is the data that thetransformations are to be performed on:bl.compute(traffic_2013, traffic_pd).head(2)Here is the output of the preceding snippet:https://www.iteblog.comYou can also pass a list of lists or a list of NumPy arrays. Here, we usethe .values attribute of the DataFrame to access the underlying list ofNumPy arrays that form the DataFrame:bl.compute(traffic_2013, traffic_pd.values)[0:2]This code will produce precisely what we would expect:Operations on columnsBlaze allows for easy mathematical operations to be done on numericcolumns. All the traffic violations cited in the dataset occurred between2013 and 2016. You can check that by getting all the distinct values forthe Stop_year column using the .distinct() method. The .sort()method sorts the results in an ascending order:traffic['Stop_year'].distinct().sort()The preceding code produces the following output table:https://www.iteblog.comAn equivalent syntax for pandas would be as follows:traffic['Stop_year'].unique().sort()For SQL, use the following code:SELECT DISTINCT Stop_yearFROM trafficYou can also make some mathematical transformations/arithmetic to thecolumns. Since all the traffic violations occurred after year 2000, we cansubtract 2000 from the Stop_year column without losing any accuracy:traffic['Stop_year'].head(2) - 2000Here is what you should get in return:https://www.iteblog.comThe same could be attained from pandas DataFrame with an identicalsyntax (assuming traffic was of pandas DataFrame type). For SQL, theequivalent would be:SELECT Stop_year - 2000 AS Stop_yearFROM trafficHowever, if you want to do some more complex mathematicaloperations (for example, log or pow) then you first need to use the oneprovided by Blaze (that, in the background, will translate your commandto a suitable method from NumPy, math, or pandas).For example, if you wanted to log-transform the Stop_year you need touse this code:bl.log(traffic['Stop_year']).head(2)This will produce the following output:Reducing dataSome reduction methods are also available, such as .mean() (thatcalculates the average), .std (that calculates standard deviation), or.max() (that returns the maximum from the list). Executing the followingcode:traffic['Stop_year'].max()https://www.iteblog.comIt will return the following output:If you had a pandas DataFrame you can use the same syntax, whereasfor SQL the same could be done with the following code:SELECT MAX(Stop_year) AS Stop_year_maxFROM trafficIt is also quite easy to add more columns to your dataset. Say, youwanted to calculate the age of the car (in years) at the time when theviolation occurred. First, you would take the Stop_year and subtract theYear of manufacture.In the following code snippet, the first argument to the .transform(...)method is the DataShape, the transformation is to be performed on, andthe other(s) would be a list of transformations.traffic = bl.transform(traffic, Age_of_car = traffic.Stop_year - traffic.Year)traffic.head(2)NoteIn the source code of the .transform(...) method such lists would beexpressed as *args as you could specify more than one column to becreated in one go. The *args argument to any method would take anynumber of subsequent arguments and treat it as if it was a list.The above code produces the following table:https://www.iteblog.comAn equivalent operation in pandas could be attained through thefollowing code:traffic['Age_of_car'] = traffic.apply( lambda row: row.Stop_year - row.Year, axis = 1)For SQL you can use the following code:SELECT * , Stop_year - Year AS Age_of_carFROM trafficIf you wanted to calculate the average age of the car involved in a fataltraffic violation and count the number of occurrences, you can performa group by operation using the .by(...) operation:bl.by(traffic['Fatal'], Fatal_AvgAge=traffic.Age_of_car.mean(), Fatal_Count =traffic.Age_of_car.count())The first argument to .by(...) specifies the column of the DataShape toperform the aggregation by, followed by a series of aggregations wewant to get. In this example, we select the Age_of_car column andcalculate an average and count the number of rows per each value of the'Fatal' column.https://www.iteblog.comThe preceding script produces the following aggregation:For pandas, an equivalent would be as follows:traffic\ .groupby('Fatal')['Age_of_car']\ .agg({ 'Fatal_AvgAge': np.mean, 'Fatal_Count': np.count_nonzero })For SQL, it would be as follows:SELECT Fatal , AVG(Age_of_car) AS Fatal_AvgAge , COUNT(Age_of_car) AS Fatal_CountFROM trafficGROUP BY FatalJoinsJoining two DataShapes is straightforward as well. To present how this isdone, although the same result could be attained differently, we firstselect all the traffic violations by violation type (the violation object)and the traffic violations involving belts (the belts object):violation = traffic[ ['Stop_month','Stop_day','Stop_year', 'Stop_hr','Stop_min','Stop_sec','Violation_Type']]belts = traffic[ ['Stop_month','Stop_day','Stop_year', 'Stop_hr','Stop_min','Stop_sec','Belts']]https://www.iteblog.comNow, we join the two objects on the six date and time columns.NoteThe same effect could have been attained if we just simply selected thetwo columns: Violation_type and Belts in one go. However, thisexample is to show the mechanics of the .join(...) method, so bearwith us.The first argument to the .join(...) method is the first DataShape wewant to join with, the second argument is the second DataShape, whilethe third argument can be either a single column or a list of columns toperform the join on:violation_belts = bl.join(violation, belts, ['Stop_month','Stop_day','Stop_year', 'Stop_hr','Stop_min','Stop_sec'])Once we have the full dataset in place, let's check how many trafficviolations involved belts and what sort of punishment was issued to thedriver:bl.by(violation_belts[['Violation_Type', 'Belts']], Violation_count=violation_belts.Belts.count()).sort('Violation_count', ascending=False)Here's the output of the preceding script:https://www.iteblog.comThe same could be achieved in pandas with the following code:violation.merge(belts, on=['Stop_month','Stop_day','Stop_year', 'Stop_hr','Stop_min','Stop_sec']) \ .groupby(['Violation_type','Belts']) \ .agg({ 'Violation_count': np.count_nonzero }) \ .sort('Violation_count', ascending=False)With SQL, you would use the following snippet:SELECT innerQuery.*FROM ( SELECT a.Violation_type , b.Belts , COUNT() AS Violation_count FROM violation AS a INNER JOIN belts AS b ON a.Stop_month = b.Stop_monthhttps://www.iteblog.com AND a.Stop_day = b.Stop_day AND a.Stop_year = b.Stop_year AND a.Stop_hr = b.Stop_hr AND a.Stop_min = b.Stop_min AND a.Stop_sec = b.Stop_sec GROUP BY Violation_type , Belts) AS innerQueryORDER BY Violation_count DESChttps://www.iteblog.comSummaryThe concepts presentedin this chapter are just the beginning of the roadto using Blaze. There are many other ways it can be used and datasources it can connect with. Treat this as a starting point to build yourunderstanding of polyglot persistence.Note, however, that these days most of the concepts explained in thischapter can be attained natively within Spark, as you can useSQLAlchemy directly within Spark making it easy to work with avariety of data sources. The advantage of doing so, despite the initialinvestment of learning the API of SQLAlchemy, is that the data returnedwill be stored in a Spark DataFrame and you will have access toeverything that PySpark has to offer. This, by no means, implies that younever should never use Blaze: the choice, as always, is yours.In the next chapter, you will learn about streaming and how to do it withSpark. Streaming has become an increasingly important topic these days,as, daily (true as of 2016), the world produces roughly 2.5 exabytes ofdata (source: http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day/) that need to be ingested, processedand made sense of.https://www.iteblog.comhttp://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day/Chapter 10. StructuredStreamingThis chapter will provide a jump-start on the concepts behind SparkStreaming and how this has evolved into Structured Streaming. Animportant aspect of Structured Streaming is that it utilizes SparkDataFrames. This shift in paradigm will make it easier for Pythondevelopers to start working with Spark Streaming.In this chapter, your will learn:What is Spark Streaming?Why do we need Spark Streaming?What is the Spark Streaming application data flow?Simple streaming application using DStreamA quick primer on Spark Streaming global aggregationsIntroducing Structured StreamingNote, for the initial sections of this chapter, the example code used willbe in Scala, as this was how most Spark Streaming code was written.When we start focusing on Structured Streaming, we will work withPython examples.What is Spark Streaming?At its core, Spark Streaming is a scalable, fault-tolerant streamingsystem that takes the RDD batch paradigm (that is, processing data inbatches) and speeds it up. While it is a slight over-simplification,basically Spark Streaming operates in mini-batches or batch intervals(from 500ms to larger interval windows).As noted in the following diagram, Spark Streaming receives an inputdata stream and internally breaks that data stream into multiple smallerbatches (the size of which is based on the batch interval). The Sparkengine processes those batches of input data to a result set of batches ofhttps://www.iteblog.comprocessed data.Source: Apache Spark Streaming Programming Guide at:http://spark.apache.org/docs/latest/streaming-programming-guide.htmlThe key abstraction for Spark Streaming is Discretized Stream(DStream), which represents the previously mentioned small batchesthat make up the stream of data. DStreams are built on RDDs, allowingSpark developers to work within the same context of RDDs and batches,only now applying it to their streaming problems. Also, an importantaspect is that, because you are using Apache Spark, Spark Streamingintegrates with MLlib, SQL, DataFrames, and GraphX.The following figure denotes the basic components of Spark Streaming:Source: Apache Spark Streaming Programming Guide at:http://spark.apache.org/docs/latest/streaming-programming-guide.htmlhttps://www.iteblog.comhttp://spark.apache.org/docs/latest/streaming-programming-guide.htmlhttp://spark.apache.org/docs/latest/streaming-programming-guide.htmlSpark Streaming is a high-level API that provides fault-tolerant exactly-once semantics for stateful operations. Spark Streaming has built inreceivers that can take on many sources, with the most common beingApache Kafka, Flume, HDFS/S3, Kinesis, and Twitter. For example, themost commonly used integration between Kafka and Spark Streaming iswell documented in the Spark Streaming + Kafka Integration Guidefound at: https://spark.apache.org/docs/latest/streaming-kafka-integration.html.Also, you can create your own custom receiver, such as the MeetupReceiver (https://github.com/actions/meetup-stream/blob/master/src/main/scala/receiver/MeetupReceiver.scala),which allows you to read the Meetup Streaming API(https://www.meetup.com/meetup_api/docs/stream/2/rsvps/) using SparkStreaming.NoteWatch the Meetup Receiver in ActionIf you are interested in seeking the Spark Streaming Meetup Receiver inaction, you can refer to the Databricks notebooks at:https://github.com/dennyglee/databricks/tree/master/notebooks/Users/denny%40databricks.com/content/Streaming%20Meetup%20RSVPswhich utilize the previously mentioned Meetup Receiver.The following is a screenshot of the notebook in action left window,while viewing the Spark UI (Streaming Tab) on the right.https://www.iteblog.comhttps://spark.apache.org/docs/latest/streaming-kafka-integration.htmlhttps://github.com/actions/meetup-stream/blob/master/src/main/scala/receiver/MeetupReceiver.scalahttps://www.meetup.com/meetup_api/docs/stream/2/rsvps/)https://github.com/dennyglee/databricks/tree/master/notebooks/Users/denny%40databricks.com/content/Streaming%20Meetup%20RSVPshttps://www.iteblog.comYou will be able to use Spark Streaming to receive Meetup RSVPs fromaround the country (or world) and get a near real-time summary ofMeetup RSVPs by state (or country). Note, these notebooks arecurrently written in Scala.https://www.iteblog.comWhy do we need SparkStreaming?As noted by Tathagata Das – committer and member of the projectmanagement committee (PMC) to the Apache Spark project and leaddeveloper of Spark Streaming – in the Datanami article SparkStreaming: What is It and Who's Using it(https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/), there is a business need for streaming. With theprevalence of online transactions and social media, as well as sensorsand devices, companies are generating and processing more data at afaster rate.The ability to develop actionable insight at scale and in real timeprovides those businesses with a competitive advantage. Whether youare detecting fraudulent transactions, providing real-time detection ofsensor anomalies, or reacting to the next viral tweet, streaming analyticsis becoming increasingly important in data scientists' and data engineer'stoolbox.The reason Spark Streaming is itself being rapidly adopted is becauseApache Spark unifies all of these disparate data processing paradigms(Machine Learning via ML and MLlib, Spark SQL, and Streaming)within the same framework. So, you can go from training machinelearning models (ML or MLlib), to scoring data with these models(Streaming) and perform analysis using your favourite BI tool (SQL) –all within the same framework. Companies including Uber, Netflix, andPinterest often showcase their Spark Streaming use cases:How Uber Uses Spark and Hadoop to Optimize CustomerExperience: https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/Spark and Spark Streaming at Netflix: https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/Can Spark Streaming survive Chaos Monkey?https://www.iteblog.comhttps://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.htmlReal-time analytics at Pinterest:https://engineering.pinterest.com/blog/real-time-analytics-pinterestCurrently, there are four broad use cases surrounding Spark Streaming:Streaming ETL: Data is continuously being cleansed andaggregated priormistake in one of our books—maybe a mistake in the text or the code—we would be grateful if youwould report this to us. By doing so, you can save other readers fromfrustration and help us improve subsequent versions of this book. If youfind any errata, please report them by visitinghttp://www.packtpub.com/submit-errata, selecting your book, clickingon the errata submission form link, and entering the details of yourerrata. Once your errata are verified, your submission will be acceptedand the errata will be uploaded on our website, or added to any list ofexisting errata, under the Errata section of that title. Any existing erratacan be viewed by selecting your title fromhttp://www.packtpub.com/support.PiracyPiracy of copyright material on the Internet is an ongoing problemacross all media. At Packt, we take the protection of our copyright andlicenses very seriously. If you come across any illegal copies of ourworks, in any form, on the Internet, please provide us with the locationaddress or website name immediately so that we can pursue a remedy.Please contact us at with a link to thesuspected pirated material.https://www.iteblog.comhttps://github.com/PacktPublishing/https://www.packtpub.com/sites/default/files/downloads/LearningPySpark_ColorImages.pdfhttp://www.packtpub.com/submit-erratahttp://www.packtpub.com/supportmailto:copyright@packtpub.comWe appreciate your help in protecting our authors, and our ability tobring you valuable content.QuestionsYou can contact us at if you are having aproblem with any aspect of the book, and we will do our best to addressit.https://www.iteblog.commailto:questions@packtpub.comChapter 1. Understanding SparkApache Spark is a powerful open source processing engine originallydeveloped by Matei Zaharia as a part of his PhD thesis while at UCBerkeley. The first version of Spark was released in 2012. Since then, in2013, Zaharia co-founded and has become the CTO at Databricks; healso holds a professor position at Stanford, coming from MIT. At thesame time, the Spark codebase was donated to the Apache SoftwareFoundation and has become its flagship project.Apache Spark is fast, easy to use framework, that allows you to solve awide variety of complex data problems whether semi-structured,structured, streaming, and/or machine learning / data sciences. It alsohas become one of the largest open source communities in big data withmore than 1,000 contributors from 250+ organizations and with300,000+ Spark Meetup community members in more than 570+locations worldwide.In this chapter, we will provide a primer to understanding Apache Spark.We will explain the concepts behind Spark Jobs and APIs, introduce theSpark 2.0 architecture, and explore the features of Spark 2.0.The topics covered are:What is Apache Spark?Spark Jobs and APIsReview of Resilient Distributed Datasets (RDDs), DataFrames, andDatasetsReview of Catalyst Optimizer and Project TungstenReview of the Spark 2.0 architectureWhat is Apache Spark?Apache Spark is an open-source powerful distributed querying andprocessing engine. It provides flexibility and extensibility of MapReducebut at significantly higher speeds: Up to 100 times faster than Apachehttps://www.iteblog.comHadoop when data is stored in memory and up to 10 times whenaccessing disk.Apache Spark allows the user to read, transform, and aggregate data, aswell as train and deploy sophisticated statistical models with ease. TheSpark APIs are accessible in Java, Scala, Python, R and SQL. ApacheSpark can be used to build applications or package them up as librariesto be deployed on a cluster or perform quick analytics interactivelythrough notebooks (like, for instance, Jupyter, Spark-Notebook,Databricks notebooks, and Apache Zeppelin).Apache Spark exposes a host of libraries familiar to data analysts, datascientists or researchers who have worked with Python's pandas or R'sdata.frames or data.tables. It is important to note that while SparkDataFrames will be familiar to pandas or data.frames / data.tablesusers, there are some differences so please temper your expectations.Users with more of a SQL background can use the language to shapetheir data as well. Also, delivered with Apache Spark are several alreadyimplemented and tuned algorithms, statistical models, and frameworks:MLlib and ML for machine learning, GraphX and GraphFrames forgraph processing, and Spark Streaming (DStreams and Structured).Spark allows the user to combine these libraries seamlessly in the sameapplication.Apache Spark can easily run locally on a laptop, yet can also easily bedeployed in standalone mode, over YARN, or Apache Mesos - either onyour local cluster or in the cloud. It can read and write from a diversedata sources including (but not limited to) HDFS, Apache Cassandra,Apache HBase, and S3:https://www.iteblog.comSource: Apache Spark is the smartphone of Big Datahttp://bit.ly/1QsgaNjNoteFor more information, please refer to: Apache Spark is the Smartphoneof Big Data at http://bit.ly/1QsgaNjhttps://www.iteblog.comhttp://bit.ly/1QsgaNjhttp://bit.ly/1QsgaNjSpark Jobs and APIsIn this section, we will provide a short overview of the Apache SparkJobs and APIs. This provides the necessary foundation for thesubsequent section on Spark 2.0 architecture.Execution processAny Spark application spins off a single driver process (that can containmultiple jobs) on the master node that then directs executor processes(that contain multiple tasks) distributed to a number of worker nodes asnoted in the following diagram:The driver process determines the number and the composition of thetask processes directed to the executor nodes based on the graphgenerated for the given job. Note, that any worker node can executehttps://www.iteblog.comtasks from a number of different jobs.A Spark job is associated with a chain of object dependencies organizedin a direct acyclic graph (DAG) such as the following example generatedfrom the Spark UI. Given this, Spark can optimize the scheduling (forexample, determine the number of tasks and workers required) andexecution of these tasks:NoteFor more information on the DAG scheduler, please refer tohttp://bit.ly/29WTiK8.Resilient Distributed DatasetApache Spark is built around a distributed collection of immutable Javahttps://www.iteblog.comhttp://bit.ly/29WTiK8Virtual Machine (JVM) objects called Resilient Distributed Datasets(RDDs for short). As we are working with Python, it is important to notethat the Python data is stored within these JVM objects. More of thiswill be discussed in the subsequent chapters on RDDs and DataFrames.These objects allow any job to perform calculations very quickly. RDDsare calculated against, cached, and stored in-memory: a scheme thatresults in orders of magnitude faster computations compared to othertraditional distributed frameworks like Apache Hadoop.At the same time, RDDs expose some coarse-grained transformations(such as map(...), reduce(...), and filter(...) which we will coverin greater detail in Chapter 2, Resilient Distributed Datasets), keepingthe flexibility and extensibility of the Hadoop platform to perform awide variety of calculations. RDDs apply and log transformations to thedata in parallel, resulting in both increased speed and fault-tolerance. Byregistering the transformations, RDDs provide data lineage - a form ofan ancestry tree for each intermediate step in the form of a graph. This,in effect, guards the RDDs against data loss - if a partition of an RDD islost it still has enough information to recreate that partition instead ofsimply depending on replication.NoteIf you want to learn more about data lineage check this linkhttp://ibm.co/2ao9B1t .RDDs have two sets of parallel operations: transformations (whichreturn pointers to new RDDs) and actions (which return values toto being pushed downstream. This is commonlydone to reduce the amount of data to be stored in the final datastore.Triggers: Real-time detection of behavioral or anomaly eventstrigger immediate and downstream actions. For example, a devicethat is within the proximity of a detector or beacon will trigger analert.Data enrichment: Real-time data joined to other datasets allowingfor richer analysis. For example, including real-time weatherinformation with flight information to build better travel alerts.Complex sessions and continuous learning: Multiple sets of eventsassociated with real-time streams are continuously analyzed and/orupdating machine learning models. For example, the stream of useractivity associated with an online game that allows us to bettersegment the user.https://www.iteblog.comhttp://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.htmlhttps://engineering.pinterest.com/blog/real-time-analytics-pinterestWhat is the Spark Streamingapplication data flow?The following figure provides the data flow between the Spark driver,workers, streaming sources and targets:It all starts with the Spark Streaming Context, represented byssc.start() in the preceding figure:1. When the Spark Streaming Context starts, the driver will execute along-running task on the executors (that is, the Spark workers).2. The Receiver on the executors (Executor 1 in this diagram)receives a data stream from the Streaming Sources. With theincoming data stream, the receiver divides the stream into blocksand keeps these blocks in memory.3. These blocks are also replicated to another executor to avoid dataloss.4. The block ID information is transmitted to the Block Managementhttps://www.iteblog.comMaster on the driver.5. For every batch interval configured within Spark Streaming Context(commonly this is every 1 second), the driver will launch Sparktasks to process the blocks. Those blocks are then persisted to anynumber of target data stores, including cloud storage (for example,S3, WASB, and so on), relational data stores (for example, MySQL,PostgreSQL, and so on), and NoSQL stores.Suffice it to say, there are a lot of moving parts for a streamingapplication that need to be continually optimized and configured. Mostof the documentation for Spark Streaming is more complete in Scala, so,as you are working with the Python APIs, you may sometimes need toreference the Scala version of the documentation instead. If this happensto you, please file a bug and/or fill out a PR if you have a proposed fix(https://issues.apache.org/jira/browse/spark/).For a deeper dive on this topic, please refer to:1. Spark 1.6 Streaming Programming Guide:https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html2. Tathagata Das' Deep Dive with Spark Streaming (Spark Meetup2013-06-17): http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617https://www.iteblog.comhttps://issues.apache.org/jira/browse/spark/https://spark.apache.org/docs/1.6.0/streaming-programming-guide.htmlhttp://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617Simple streaming applicationusing DStreamsLet's create a simple word count example using Spark Streaming inPython. For this example, we will be working with DStream – theDiscretized Stream of small batches that make up the stream of data.The example used for this section of the book can be found in itsentirety at:https://github.com/drabastomek/learningPySpark/blob/master/Chapter10/streaming_word_count.pyThis word count example will use the Linux / Unix nc command – it is asimple utility that reads and writes data across network connections. Wewill use two different bash terminals, one using the nc command to sendwords to our computer's local port (9999) and one terminal that will runSpark Streaming to receive those words and count them. The initial setof commands for our script are noted here:1. # Create a local SparkContext and Streaming Contexts2. from pyspark import SparkContext3. from pyspark.streaming import StreamingContext4. 5. # Create sc with two working threads 6. sc = SparkContext("local[2]", "NetworkWordCount")7. 8. # Create local StreamingContextwith batch interval of 1 second9. ssc = StreamingContext(sc, 1)10. 11. # Create DStream that connects to localhost:999912. lines = ssc.socketTextStream("localhost", 9999)Here are some important call outs for the preceding commands:1. The StreamingContext on line 9 is the entry point into SparkStreaming2. The 1 of ...(sc, 1) on line 9 is the batch interval; in this case, weare running micro-batches every second.3. The lines on line 12 is the DStream representing the data streamhttps://www.iteblog.comhttps://github.com/drabastomek/learningPySpark/blob/master/Chapter10/streaming_word_count.pyextracted via the ssc.socketTextStream.4. As noted in the description, the ssc.socketTextStream is the SparkStreaming method to review a text stream for a particular socket; inthis case, your local computer on socket 9999.The next few lines of code (as described in the comments), split the linesDStream into words and then, using RDDs, count each word in eachbatch of data and print this information out to the console (line number9):1. # Split lines into words2. words = lines.flatMap(lambda line: line.split(" "))3. 4. # Count each word in each batch5. pairs = words.map(lambda word: (word, 1))6. wordCounts = pairs.reduceByKey(lambda x, y: x + y)7. 8. # Print the first ten elements of each RDD in this DStream 9. wordCounts.pprint()The final set of lines of the code start Spark Streaming (ssc.start())and then await a termination command to stop running (for example,). If no termination command is sent, then the SparkStreaming program will continue running.# Start the computationssc.start() # Wait for the computation to terminatessc.awaitTermination() Now that you have your script, as noted earlier, open two terminalwindows – one for your nc command, and one for your Spark StreamingProgram. To start the nc command, from one of your terminals, type:nc –lk 9999Everything you type from this point onwards in that terminal will betransmitted to port 9999, as noted in the following screenshot:https://www.iteblog.comIn this example (as noted previously), I typed the words green threetimes and blue five times. From the other terminal screen, let's run thePython streaming script you just created. In this example, I named thescript streaming_word_count.py../bin/spark-submitstreaming_word_count.py localhost 9999.The command will run the streaming_word_count.py script, readingyour local computer (that is, localhost) port 9999 to receive any wordssent to that socket. As you have already sent information to the port onthe first screen, shortly after starting up the script, your Spark Streamingprogram will read the words sent to port 9999 and perform a word countas noted in the following screenshot:The streaming_word_count.py script will continue to read and print anynew information to the console. Going back to our first terminal (withthe nc command), we now can type our next set of words, as noted inthe following screenshot:https://www.iteblog.comReviewing the streaming script in the second terminal, you will noticethat this script continues to run every second (that is, the configuredbatch interval), and you will notice the calculated word count forgohawks a few seconds later:With this relatively simple script, now you can see Spark Streaming inaction with Python. But if you continue typing words into the nchttps://www.iteblog.comterminal, you will notice that this information is not aggregated. Forexample, if we continue to write green in the nc terminal (as noted here):The Spark Streaming terminal will report the current snapshot of data;that is, the two additionalgreen values (as noted here):What did not happen was the concept of global aggregations, where wewould keep state for this information. What this means is that, instead ofreporting 2 new greens, we could get Spark Streaming to give us theoverall counts of green, for example, 7 greens, 5 blues, and 1 gohawks.https://www.iteblog.comWe will talk about global aggregations in the form of UpdateStateByKey/ mapWithState in the next section.TipFor other good PySpark Streaming examples, check out:Network Wordcount (in Apache Spark GitHub repo):https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.pyPython Streaming Examples:https://github.com/apache/spark/tree/master/examples/src/main/python/streamingS3 FileStream Wordcount (Databricks notebook):https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#07%20Spark%20Streaming/06%20FileStream%20Word%20Count%20-%20Python.htmlhttps://www.iteblog.comhttps://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.pyhttps://github.com/apache/spark/tree/master/examples/src/main/python/streaminghttps://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#07%20Spark%20Streaming/06%20FileStream%20Word%20Count%20-%20Python.htmlA quick primer on globalaggregationsAs noted in the previous section, so far, our script has performed a pointin time streaming word count. The following diagram denotes the linesDStream and its micro-batches as per how our script had executed inthe previous section:At the 1 second mark, our Python Spark Streaming script returned thevalue of {(blue, 5), (green, 3)}, at the 2 second mark it returned{(gohawks, 1)}, and at the 4 second mark, it returned {(green, 2)}.But what if you had wanted the aggregate word count over a specifictime window?The following figure represents us calculating a stateful aggregation:https://www.iteblog.comIn this case, we have a time window between 0-5 seconds. Note, that inour script we have not got the specified time window: each second, wecalculate the cumulative sum of the words. Therefore, at the 2 secondmark, the output is not just the green and blue from the 1 second mark,but it also includes the gohawks from the 2 second mark: {(blue, 5),(green, 3), (gohawks, 1)}. At the 4 second mark, the additional 2greens provide us a total of {(blue, 5), (green, 5), (gohawks, 1)}.For those of you who regularly work with relational databases, thisseems to be just a GROUP BY, SUM() statement. Yet, in the case ofstreaming analytics, the duration to persist the data long enough to run aGROUP BY, SUM() statement is longer than the batch interval (forexample, 1 second). This means that we would constantly be runningbehind and trying to catch up with the data stream.For example, if you were to run the 1. Streaming and DataFrames.scalaDatabricks notebook athttps://github.com/dennyglee/databricks/blob/master/notebooks/Users/denny%40databricks.com/content/Streaming%20Meetup%20RSVPs/1.%20Streaming%20and%20DataFrames.scalaand you were to view the Streaming jobs in the Spark UI, you would getsomething like the following figure:https://www.iteblog.comhttps://github.com/dennyglee/databricks/blob/master/notebooks/Users/denny%40databricks.com/content/Streaming%20Meetup%20RSVPs/1.%20Streaming%20and%20DataFrames.scalaNotice in the graph that the Scheduling Delay and Total Delaynumbers are rapidly increasing (for example, average Total Delay is 54seconds 254 ms and the actual Total Delay is > 2min) and way outsidethe batch interval threshold of 1 second. The reason we see this delay isbecause, inside the streaming code for that notebook, we had also runthe following code:// Populate `meetup_stream` tablesqlContext.sql("insert into meetup_stream select * from meetup_stream_json")That is, inserting any new chunks of data (that is, 1 second RDD micro-batches), converting them into a DataFrame (meetup_stream_jsontable), and inserting the data into a persistent table (meetup_streamtable). Persisting the data in this fashion led to slow streaminghttps://www.iteblog.comperformance with the ever-increasing scheduling delays. To solve thisproblem via streaming analytics, this is where creating globalaggregations via UpdateStateByKey (Spark 1.5 and before) ormapWithState (Spark 1.6 onwards) come in.TipFor more information on Spark Streaming visualizations, please take thetime to review New Visualizations for Understanding Apache SparkStreaming Applications: https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-apache-spark-streaming-applications.html.Knowing this, let's re-write the original streaming_word_count.py sothat we now have a stateful version calledstateful_streaming_word_count.py; you can get the full version of thisscript athttps://github.com/drabastomek/learningPySpark/blob/master/Chapter10/stateful_streaming_word_count.pyThe initial set of commands for our script are noted here: 1. # Create a local SparkContext and Streaming Contexts 2. from pyspark import SparkContext 3. from pyspark.streaming import StreamingContext 4. 5. # Create sc with two working threads 6. sc = SparkContext("local[2]", "StatefulNetworkWordCount") 7. 8. # Create local StreamingContext with batch interval of 1 sec 9. ssc = StreamingContext(sc, 1)10. 11. # Create checkpoint for local StreamingContext12. ssc.checkpoint("checkpoint")13. 14. # Define updateFunc: sum of the (key, value) pairs15. def updateFunc(new_values, last_sum):16. return sum(new_values) + (last_sum or 0)17. 18. # Create DStream that connects to localhost:999919. lines = ssc.socketTextStream("localhost", 9999)https://www.iteblog.comhttps://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-apache-spark-streaming-applications.htmlhttps://github.com/drabastomek/learningPySpark/blob/master/Chapter10/stateful_streaming_word_count.pyIf you recall streaming_word_count.py, the primary differences start atline 11:The ssc.checkpoint("checkpoint") on line 12 configures a SparkStreaming checkpoint. To ensure that Spark Streaming is faulttolerant due to its continual operation, it needs to checkpointenough information to fault-tolerant storage, so it can recover fromfailures. Note, we will not dive deep into this concept (though moreinformation is available in the following Tip section), as many ofthese configurations will be abstracted away with StructuredStreaming.The updateFunc on line 15 tells the program to update theapplication's state (later in the code) via UpdateStateByKey. In thiscase, it is returning a sum of the previous value (last_sum) and thesum of the new values (sum(new_values) + (last_sum or 0)).At line 19, we have the same ssc.socketTextStream as the previousscript.TipFor more information on Spark Streaming checkpoint, some goodreferences are:Spark Streaming Programming Guide > Checkpoint:https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#checkpointingExploring Stateful Streaming with Apache Spark:http://asyncified.io/2016/07/31/exploring-stateful-streaming-with-apache-spark/The final section of the code is as follows: 1. # Calculate running counts 2. running_counts = lines.flatMap(lambda line: line.split(" "))\ 3. .map(lambda word: (word, 1))\ 4. .updateStateByKey(updateFunc) 5. 6. # Print the first ten elements of each RDD generated in https://www.iteblog.comhttps://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#checkpointinghttp://asyncified.io/2016/07/31/exploring-stateful-streaming-with-apache-spark/this 7. # stateful DStream to the console 8. running_counts.pprint() 9. 10. # Start the computation11. ssc.start() 12. 13. # Wait for the computation to terminate14. ssc.awaitTermination() While lines 10-14 are identical to the previous script, the difference isthat we now havea running_counts variable that splits to get the wordsand runs a map function to count each word in each batch (in theprevious script this was the words and pairs variables).The primary difference is the use of the updateStateByKey method,which will execute the previously noted updateFunc that performs thesum. updateStateByKey is Spark Streaming's method to performcalculations against your stream of data and update the state for eachkey in a performant manner. It is important to note that you wouldtypically use updateStateByKey for Spark 1.5 and earlier; theperformance of these stateful global aggregations is proportional to thesize of the state. From Spark 1.6 onwards, you should use mapWithState,as the performance is proportional to the size of the batch.TipNote, there is more code typically involved with mapWithState (incomparison to updateStateByKey), hence the examples were writtenusing updateStateByKey.For more information about stateful Spark Streaming, including the useof mapWithState, please refer to:Stateful Network Wordcount Python example:https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.pyGlobal Aggregation using mapWithState (Scala):https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#07%20Spark%20Streaming/12%20Global%20Aggregations%20-https://www.iteblog.comhttps://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.pyhttps://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#07%20Spark%20Streaming/12%20Global%20Aggregations%20-%20mapWithState.html%20mapWithState.htmlWord count using mapWithState (Scala):https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.htmlFaster Stateful Stream Processing in Apache Spark Streaming:https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.htmlhttps://www.iteblog.comhttps://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.htmlhttps://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.htmlIntroducing StructuredStreamingWith Spark 2.0, the Apache Spark community is working on simplifyingstreaming by introducing the concept of structured streaming whichbridges the concepts of streaming with Datasets/DataFrames (as noted inthe following diagram):As noted in earlier chapters on DataFrames, the execution of SQLand/or DataFrame queries within the Spark SQL Engine (and CatalystOptimizer) revolves around building a logical plan, building numerousphysical plans, the engine choosing the correct physical plan based on itscost optimizer, and then generating the code (i.e. code gen) that willdeliver the results in a performant manner. What Structured Streamingintroduces is the concept of an Incremental Execution Plan. Whenworking with blocks of data, structured streaming repeatedly applies theexecution plan for every new set of blocks it receives. By running in thismanner, the engine can take advantage of the optimizations includedwithin Spark DataFrames/Datasets and apply them to an incoming datastream. It will also be easier to integrate other DataFrame optimizedcomponents of Spark, including ML Pipelines, GraphFrames,TensorFrames, and many others.https://www.iteblog.comUsing structured streaming will also simplify your code. For example,the following is a pseudo-code example batch aggregation that reads adata stream from S3 and saves it to a MySQL database:logs = spark.read.json('s3://logs')logs.groupBy(logs.UserId).agg(sum(logs.Duration)).write.jdbc('jdbc:mysql//...')The following is a pseudo-code example for a continous aggregation:logs = spark.readStream.json('s3://logs').load()sq = logs.groupBy(logs.UserId).agg(sum(logs.Duration)).writeStream.format('json').start()The reason for creating the sq variable is that it allows you to check thestatus of your structured streaming job and terminate it, as per thefollowing:# Will return true if the `sq` stream is activesq.isActive# Will terminate the `sq` streamsq.stop()Let's take the stateful streaming word count script that had usedupdateStateByKey and make it a structured streaming word count script;you can get the complete structured_streaming_word_count.py scriptat:https://github.com/drabastomek/learningPySpark/blob/master/Chapter10/structured_streaming_word_count.pyAs opposed to the previous scripts, we are now working with the morefamiliar DataFrames code as noted here:# Import the necessary classes and create a local SparkSessionfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import explodefrom pyspark.sql.functions import splitspark = SparkSession \ .builder \ .appName("StructuredNetworkWordCount") \https://www.iteblog.comhttps://github.com/drabastomek/learningPySpark/blob/master/Chapter10/structured_streaming_word_count.py .getOrCreate()The first lines of the script import the necessary classes and establish thecurrent SparkSession. But, as opposed to the previous streaming scripts,as in the next lines of the script noted here, you do not need to establisha Streaming Context as this is already included within the SparkSession: 1. # Create DataFrame representing the stream of input lines 2. # from connection to localhost:9999 3. lines = spark\ 4. .readStream\ 5. .format('socket')\ 6. .option('host', 'localhost')\ 7. .option('port', 9999)\ 8. .load() 9.10. # Split the lines into words11. words = lines.select(12. explode(13. split(lines.value, ' ')14. ).alias('word')15. )16.17. # Generate running word count18. wordCounts = words.groupBy('word').count()Instead, the streaming portion of the code is initiated by callingreadStream in line 4.Lines 3-8 initiate the reading of the data stream from port 9999, justlike the previous two scriptsInstead of running RDD flatMap, map, and reduceByKey functions tosplit the lines read into words and count each word in each batch,we can use the PySpark SQL functions explode and split as notedin lines 10-15Instead of running updateStateByKey or creating an updateFunc asper the stateful streaming word count script, we can generate therunning word count with a familiar DataFrame groupBy statementand count(), as noted in lines 17-18To output this data to the console, we will use writeStream, as notedhere:https://www.iteblog.com 1. # Start running the query that prints the 2. # running counts to the console 3. query = wordCounts\ 4. .writeStream\ 5. .outputMode('complete')\ 6. .format('console')\ 7. .start() 8. 9. # Await Spark Streaming termination10. query.awaitTermination()Instead of using pprint(), we're explicitly calling out writeStream towrite the stream, and defining the format and output mode. While it is alittle longer to write, these methods and properties are syntacticallysimilar with other DataFrame calls and you would only need to changethe outputMode and format properties to save it to a Database, filesystem, console, and so on. Finally, as noted in line 10, we will runawaitTermination to await to cancel this streaming job.Let's go back and run our nc job in the first terminal:$ nc –lk 9999green green green blue blue blue blue bluegohawksgreen greenCheck the following output. As you can see, you get the advantages ofstateful streaming but using the more familiar DataFrame API:https://www.iteblog.comhttps://www.iteblog.comSummaryIt is important to note that Structured Streaming is currently (at the timeof writing) not production-ready. It is, however, a paradigm shift inSpark that will hopefully make it easier for data scientists and dataengineers to build continuous applications. While not explicitly calledout in the previous sections, when working with streaming applications,there are many potential problems thatyou will need to design for, suchas late events, partial outputs, state recovery on failure, distributed readsand writes, and so on. With structured streaming, many of these issueswill be abstracted away to make it easier for you to build continuousapplications.We encourage you to try Spark Structured Streaming so you will be ableto easily build streaming applications as structured streaming matures.As Reynold Xin noted in his Spark Summit 2016 East presentation TheFuture of Real-Time in Spark (http://www.slideshare.net/rxin/the-future-of-realtime-in-spark):"The simplest way to perform streaming analytics is not having toreason about streaming."For more information, here are some additional Structured Streamingresources:PySpark 2.1 Documentation: pyspark.sql.module:http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.htmlIntroducing Apache Spark 2.1:https://databricks.com/blog/2016/12/29/introducing-apache-spark-2-1.htmlStructuring Apache Spark 2.0: SQL, DataFrames, Datasets andStreaming - by Michael Armbrust:http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797Structured Streaming Programming Guide:http://spark.apache.org/docs/latest/streaming-programming-https://www.iteblog.comhttp://www.slideshare.net/rxin/the-future-of-realtime-in-sparkhttp://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.htmlhttps://databricks.com/blog/2016/12/29/introducing-apache-spark-2-1.htmlhttp://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797http://spark.apache.org/docs/latest/streaming-programming-guide.htmlguide.htmlStructured Streaming (aka Streaming DataFrames) [SPARK-8360]:https://issues.apache.org/jira/browse/SPARK-8360Structured Streaming Programming Abstraction, Semantics, andAPIs Apache JIRA:https://issues.apache.org/jira/secure/attachment/12793410/StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdfIn the next chapter we will show you how to modularize and package upyour PySpark application and submit it for execution programmatically.https://www.iteblog.comhttps://issues.apache.org/jira/browse/SPARK-8360https://issues.apache.org/jira/secure/attachment/12793410/StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdfChapter 11. Packaging SparkApplicationsSo far we have been working with a very convenient way of developingcode in Spark - the Jupyter notebooks. Such an approach is great whenyou want to develop a proof of concept and document what you doalong the way.However, Jupyter notebooks will not work if you need to schedule a job,so it runs every hour. Also, it is fairly hard to package your applicationas it is not easy to split your script into logical chunks with well-definedAPIs - everything sits in a single notebook.In this chapter, we will learn how to write your scripts in a reusable formof modules and submit jobs to Spark programmatically.Before you begin, however, you might want to check out the BonusChapter 2, Free Spark Cloud Offering where we provide instructionson how to subscribe and use either Databricks' Community Edition orMicrosoft's HDInsight Spark offerings; the instructions on how to do socan be found here:https://www.packtpub.com/sites/default/files/downloads/FreeSparkCloudOffering.pdfIn this chapter you will learn:What the spark-submit command isHow to package and deploy your app programmaticallyHow to modularize your Python code and submit it along withPySpark scriptThe spark-submit commandThe entry point for submitting jobs to Spark (be it locally or on a cluster)is the spark-submit script. The script, however, allows you not only tosubmit the jobs (although that is its main purpose), but also kill jobs orhttps://www.iteblog.comhttps://www.packtpub.com/sites/default/files/downloads/FreeSparkCloudOffering.pdfcheck their status.NoteUnder the hood, the spark-submit command passes the call to thespark-class script that, in turn, starts a launcher Java application. Forthose interested, you can check the GitHub repository for Spark:https://github.com/apache/spark/blob/master/bin/sparksubmitt.The spark-submit command provides a unified API for deploying appson a variety of Spark supported cluster managers (such as Mesos orYarn), thus relieving you from configuring your application for each ofthem separately.On the general level, the syntax looks as follows:spark-submit [options] [app arguments]We will go through the list of all the options soon. The app argumentsare the parameters you want to pass to your application.NoteYou can either parse the parameters from the command line yourselfusing sys.argv (after import sys) or you can utilize the argparsemodule for Python.Command line parametersYou can pass a host of different parameters for Spark engine when usingspark-submit.NoteIn what follows we will cover only the parameters specific for Python(as spark-submit can also be used to submit applications written inScala or Java and packaged as .jar files).We will now go through the parameters one-by-one so you have a goodhttps://www.iteblog.comhttps://github.com/apache/spark/blob/master/bin/spark-submitoverview of what you can do from the command line:--master: Parameter used to set the URL of the master (head) node.Allowed syntax is:local: Used for executing your code on your local machine. Ifyou pass local, Spark will then run in a single thread (withoutleveraging any parallelism). On a multi-core machine you canspecify either, the exact number of cores for Spark to use bystating local[n] where n is the number of cores to use, or runSpark spinning as many threads as there are cores on themachine using local[*].spark://host:port: It is a URL and a port for the Sparkstandalone cluster (that does not run any job scheduler such asMesos or Yarn).mesos://host:port: It is a URL and a port for the Spark clusterdeployed over Mesos.yarn: Used to submit jobs from a head node that runs Yarn asthe workload balancer.--deploy-mode: Parameter that allows you to decide whether tolaunch the Spark driver process locally (using client) or on one ofthe worker machines inside the cluster (using the cluster option).The default for this parameter is client. Here's an excerpt fromSpark's documentation that explains the differences with morespecificity (source: http://bit.ly/2hTtDVE):A common deployment strategy is to submit your applicationfrom [a screen session on] a gateway machine that isphysically co-located with your worker machines (e.g. Masternode in a standalone EC2 cluster). In this setup, client mode isappropriate. In client mode, the driver is launched directlywithin the spark-submit process which acts as a client to thecluster. The input and output of the application is attached tothe console. Thus, this mode is especially suitable forapplications that involve the REPL (e.g. Spark shell).Alternatively, if your application is submitted from a machinefar from the worker machines (e.g. locally on your laptop), ithttps://www.iteblog.comhttp://bit.ly/2hTtDVEis common to use cluster mode to minimize network latencybetween the drivers and the executors. Currently, standalonemode does not support cluster mode for Python applications.--name: Name of your application. Note that if you specified thename of your app programmatically when creating SparkSession(we will get to that in the next section) then the parameter from thecommand line will be overridden. We will explain the precedence ofparameters shortly when discussing the --conf parameter.--py-files: Comma-delimited list of .py, .egg or .zip files toinclude for Python apps. These files will be delivered to eachexecutor for use. Later in this chapter we will show you how topackage your code into a module.--files: Command gives a comma-delimited list of files that willalso be delivered to each executor to use.--conf: Parameterto change a configuration of your appdynamically from the command line. The syntax is =. For example, you can pass --conf spark.local.dir=/home/SparkTemp/ or --confspark.app.name=learningPySpark; the latter would be an equivalentof submitting the --name property as explained previously.NoteSpark uses the configuration parameters from three places: theparameters from the SparkConf you specify when creatingSparkContext within your app take the highest precedence, then anyparameter that you pass to the spark-submit script from thecommand line, and lastly, any parameter that is specified in theconf/spark-defaults.conf file.--properties-file: File with a configuration. It should have thesame set of properties as the conf/spark-defaults.conf file as itwill be read instead of it.--driver-memory: Parameter that specifies how much memory toallocate for the application on the driver. Allowed values have asyntax similar to the 1,000M, 2G. The default is 1,024M.--executor-memory: Parameter that specifies how much memory tohttps://www.iteblog.comallocate for the application on each of the executors. The default is1G.--help: Shows the help message and exits.--verbose: Prints additional debug information when running yourapp.--version: Prints the version of Spark.In a Spark standalone with cluster deploy mode only, or on a clusterdeployed over Yarn, you can use the --driver-cores that allowsspecifying the number of cores for the driver (default is 1). In a Sparkstandalone or Mesos with cluster deploy mode only you also have theopportunity to use either of these:--supervise: Parameter that, if specified, will restart the driver if itis lost or fails. This also can be set in Yarn by setting the --deploy-mode to cluster--kill: Will finish the process given its submission_id--status: If this command is specified, it will request the status ofthe specified appIn a Spark standalone and Mesos only (with the client deploy mode)you can also specify the --total-executor-cores, a parameter that willrequest the number of cores specified for all executors (not each). Onthe other hand, in a Spark standalone and YARN, only the --executor-cores parameter specifies the number of cores per executor (defaults to1 in YARN mode, or to all available cores on the worker in standalonemode).In addition, when submitting to a YARN cluster you can specify:--queue: This parameter specifies a queue on YARN to submit thejob to (default is default)--num-executors: Parameter that specifies how many executormachines to request for the job. If dynamic allocation is enabled, theinitial number of executors will be at least the number specified.Now that we have discussed all the parameters it is time to put it intopractice.https://www.iteblog.comDeploying the appprogrammaticallyUnlike the Jupyter notebooks, when you use the spark-submitcommand, you need to prepare the SparkSession yourself and configureit so your application runs properly.In this section, we will learn how to create and configure theSparkSession as well as how to use modules external to Spark.NoteIf you have not created your free account with either Databricks orMicrosoft (or any other provider of Spark) do not worry - we will be stillusing your local machine as this is easier to get us started. However, ifyou decide to take your application to the cloud it will literally onlyrequire changing the --master parameter when you submit the job.Configuring your SparkSessionThe main difference between using Jupyter and submitting jobsprogrammatically is the fact that you have to create your Spark context(and Hive, if you plan to use HiveQL), whereas when running Sparkwith Jupyter the contexts are automatically started for you.In this section, we will develop a simple app that will use public datafrom Uber with trips made in the NYC area in June 2016; wedownloaded the dataset from https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-06.csv (beware as it is an almost3GB file). The original dataset contains 11 million trips, but for ourexample we retrieved only 3.3 million and selected only a subset of allavailable columns.NoteThe transformed dataset can be downloaded fromhttps://www.iteblog.comhttps://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-06.csvhttp://www.tomdrabas.com/data/LearningPySpark/uber_data_nyc_2016-06_3m_partitioned.csv.zip. Download the file and unzip it to theChapter13 folder from GitHub. The file might look strange as it isactually a directory containing four files inside that, when read by Spark,will form one dataset.So, let's get to it!Creating SparkSessionThings with Spark 2.0 have become slightly simpler than with previousversions when it comes to creating SparkContext. In fact, instead ofcreating a SparkContext explicitly, Spark currently uses SparkSession toexpose higher-level functionality. Here's how you do it:from pyspark.sql import SparkSessionspark = SparkSession \ .builder \ .appName('CalculatingGeoDistances') \ .getOrCreate()print('Session created')The preceding code is all that you need!TipIf you want to use RDD API you still can. However, you do not need tocreate a SparkContext anymore as SparkSession starts one under thehood. To get the access you can simply call (borrowing from thepreceding example): sc = spark.SparkContext.In this example, we first create the SparkSession object and call its.builder internal class. The .appName(...) method allows us to give ourapplication a name, and the .getOrCreate() method either creates orretrieves an already created SparkSession. It is a good convention togive your application a meaningful name as it helps to (1) find yourapplication on a cluster and (2) creates less confusion for everyone.https://www.iteblog.comhttp://www.tomdrabas.com/data/LearningPySpark/uber_data_nyc_2016-06_3m_partitioned.csv.zipNoteUnder the hood, the Spark session creates a SparkContext object. Whenyou call .stop() on SparkSession it actually terminates theSparkContext within.Modularizing codeBuilding your code in such a way so it can be reused later is always agood thing. The same can be done with Spark - you can modularize yourmethods and then reuse them at a later point. It also aids readability ofyour code and its maintainability.In this example, we will build a module that would do some calculationson our dataset: It will compute the as-the-crow-flies distance (in miles)between the pickup and drop-off locations (using the Haversineformula), and also will convert the calculated distance from miles intokilometers.NoteMore on the Haversine formula can be found here: http://www.movable-type.co.uk/scripts/latlong.html.So, first, we will build a module.Structure of the moduleWe put the code for our extraneous methods inside the additionalCodefolder.TipCheck out the GitHub repository for this book if you have not done soalreadyhttps://github.com/drabastomek/learningPySpark/tree/master/Chapter11.The tree for the folder looks as follows:https://www.iteblog.comhttp://www.movable-type.co.uk/scripts/latlong.htmlhttps://github.com/drabastomek/learningPySpark/tree/master/Chapter11As you can see, it has a structure of a somewhat normal Pythonpackage: At the top we have the setup.py file so we can package up ourmodule, and then inside we have our code.The setup.py file in our case looks as follows:from setuptools import setupsetup( name='PySparkUtilities', version='0.1dev', packages=['utilities', 'utilities/converters'], license=''' Creative Commons Attribution-Noncommercial-Share Alike license''', long_description=''' An example of how to package code for PySpark''')We will not delve into details here on the structure (which on its own isfairly self-explanatory): You can read more about how to definesetup.py files forother projects herehttps://pythonhosted.org/an_example_pypi_project/setuptools.html.https://www.iteblog.comhttps://pythonhosted.org/an_example_pypi_project/setuptools.htmlThe __init__.py file in the utilities folder has the following code:from .geoCalc import geoCalc__all__ = ['geoCalc','converters']It effectively exposes the geoCalc.py and converters (more on theseshortly).Calculating the distance between two pointsThe first method we mentioned uses the Haversine formula to calculatethe direct distance between any two points on a map (Cartesiancoordinates). The code that does this lives in the geoCalc.py file of themodule.The calculateDistance(...) is a static method of the geoCalc class. Ittakes two geo-points, expressed as either a tuple or a list with twoelements (latitude and longitude, in that order), and uses the Haversineformula to calculate the distance. The Earth's radius necessary tocalculate the distance is expressed in miles so the distance calculatedwill also be in miles.Converting distance unitsWe build the utilities package so it can be more universal. As a part ofthe package we expose methods to convert between various units ofmeasurement.NoteAt this time we limit it to the distance only, but the functionality can befurther extended to other domains such as area, volume, or temperature.For ease of use, any class implemented as a converter should exposethe same interface. That is why it is advised that such a class derivesfrom our BaseConverter class (see base.py):from abc import ABCMeta, abstractmethodclass BaseConverter(metaclass=ABCMeta):https://www.iteblog.com @staticmethod @abstractmethod def convert(f, t): raise NotImplementedErrorIt is a purely abstract class that cannot be instantiated: Its sole purpose isto force the derived classes to implement the convert(...) method. Seethe distance.py file for details of the implementation. The code shouldbe self-explanatory for someone proficient in Python so we will not begoing through it step-by-step here.Building an eggNow that we have all our code in place we can package it. Thedocumentation for PySpark states that you can pass .py files (using the --py-files switch) to the spark-submit script separated by commas.However, it is much more convenient to package our module into a .zipor an .egg. This is when the setup.py file comes handy - all you have todo is to call this inside the additionalCode folder:python setup.py bdist_eggIf all goes well you should see three additional folders:PySparkUtilities.egg-info, build, and dist - we are interested in thefile that sits in the dist folder: The PySparkUtilities-0.1.dev0-py3.5.egg.TipAfter running the preceding command, you might find that the name ofyour .egg file is slightly different as you might have a different Pythonversion. You can still use it in your Spark jobs, but you will have toadapt the spark-submit command to reflect the name of your .egg file.User defined functions in SparkIn order to do operations on DataFrames in PySpark you have twooptions: Use built-in functions to work with data (most of the time it willbe sufficient to achieve what you need and it is recommended as thecode is more performant) or create your own user-defined functions.https://www.iteblog.comTo define a UDF you have to wrap the Python function within the.udf(...) method and define its return value type. This is how we do itin our script (check the calculatingGeoDistance.py file):import utilities.geoCalc as geofrom utilities.converters import metricImperialgetDistance = func.udf( lambda lat1, long1, lat2, long2: geo.calculateDistance( (lat1, long1), (lat2, long2) ) )convertMiles = func.udf(lambda m: metricImperial.convert(str(m) + ' mile', 'km'))We can then use such functions to calculate the distance and convert itto miles:uber = uber.withColumn( 'miles', getDistance( func.col('pickup_latitude'), func.col('pickup_longitude'), func.col('dropoff_latitude'), func.col('dropoff_longitude') ) )uber = uber.withColumn( 'kilometers', convertMiles(func.col('miles')))Using the .withColumn(...) method we create additional columns withthe values of interest to us.NoteA word of caution needs to be stated here. If you use the PySpark built-in functions, even though you call them Python objects, underneath thatcall is translated and executed as Scala code. If, however, you writehttps://www.iteblog.comyour own methods in Python, it is not translated into Scala and, hence,has to be executed on the driver. This causes a significant performancehit. Check out this answer from Stack Overflow for more details:http://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-python.Let's now put all the puzzles together and finally submit our job.Submitting a jobIn your CLI type the following (we assume you keep the structure of thefolders unchanged from how it is structured on GitHub):./launch_spark_submit.sh \--master local[4] \--py-files additionalCode/dist/PySparkUtilities-0.1.dev0-py3.5.egg \calculatingGeoDistance.pyWe owe you some explanation for the launch_spark_submit.sh shellscript. In Bonus Chapter 1, Installing Spark, we configured our Sparkinstance to run Jupyter (by setting the PYSPARK_DRIVER_PYTHON systemvariable to jupyter). If you were to simply use spark-submit on amachine configured in such a way, you would most likely get somevariation of the following error:jupyter: 'calculatingGeoDistance.py' is not a Jupyter commandThus, before running the spark-submit command we first have to unsetthe variable and then run the code. This would quickly becomeextremely tiring so we automated it with the launch_spark_submit.shscript:#!/bin/bashunset PYSPARK_DRIVER_PYTHONspark-submit $*export PYSPARK_DRIVER_PYTHON=jupyterAs you can see, this is nothing more than a wrapper around the spark-https://www.iteblog.comhttp://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-pythonsubmit command.If all goes well, you will see the following stream of consciousnessappearing in your CLI:There's a host of useful things that you can get from reading the output:Current version of Spark: 2.1.0Spark UI (what will be useful to track the progress of your job) isstarted successfully on http://localhost:4040Our .egg file was added successfully to the executionThe uber_data_nyc_2016-06_3m_partitioned.csv was readsuccessfullyEach start and stop of jobs and tasks are listedOnce the job finishes, you will see something similar to the following:https://www.iteblog.comFrom the preceding screenshot, we can read that the distances arereported correctly. You can also see that the Spark UI process has nowbeen stopped and all the clean up jobs have been performed.Monitoring executionWhen you use the spark-submit command, Spark launches a localserver that allows you to track the execution of the job. Here's what thewindow looks like:At the top you can switch between the Jobs or Stages view; the Jobsview allows you to track the distinct jobs that are executed to completehttps://www.iteblog.comthe whole script, while the Stages view allows you to track all the stagesthat are executed.You can also peak inside each stage execution profile and track eachtask execution by clicking on the link of the stage. In the followingscreenshot, you can see the execution profile for Stage 3 with four tasksrunning:TipIn a cluster setup instead of driver/localhost you would see the drivernumber and host's IP address.Inside a job or a stage, you can click on the DAG Visualization to seehow your job or stage gets executed (the following chart on the leftshows the Job view, while the one on the right shows the Stage view):https://www.iteblog.comhttps://www.iteblog.comDatabricks JobsIf you are using the Databricks product, an easy way to go fromdevelopment from your Databricks notebooks to production is to use theDatabricks Jobs feature. It will allow you to:Schedule your Databricks notebook to run on an existing or newclusterSchedule at your desired frequency (from minutes to months)Schedule time out and retries for your jobBe alerted when the job starts, completes, and/or errors outView historical job runs as well as review the history of theindividual notebook job runsThis capability greatly simplifies the scheduling and productionworkflow of your job submissions. Note that you will need to upgradeyour Databricks subscription (from Community edition) to use thisfeature.To use this feature, go to the Databricks Jobs menu and click on CreateJob. From here, fill out the job name and then choose the notebook thatyou want to turn into a job, as shown in the following screenshot:https://www.iteblog.comOnce you have chosen your notebook, you can also choose whether touse an existing cluster that is running or have the job scheduler launch aNew Cluster specifically for this job, as shown in the followingscreenshot:https://www.iteblog.comOnce you have chosen your notebook and cluster; you can set theschedule, alerts, timeout, and retries. Once you have completed settingup your job, it should look something similar to the Population vs. PriceLinear Regression Job, as noted in the following screenshot:https://www.iteblog.comYou can test the job by clicking on the Run Now link under Active runsto test your job.As noted in the Meetup Streaming RSVPs Job, you can view thehistory of your completed runs; as shown in the screenshot, for thisnotebook there are 50 completed job runs:By clicking on the job run (in this case, Run 50), you can see the resultsof that job run. Not only can you view the start time, duration, andstatus, but also the results for that specific job:https://www.iteblog.comNoteREST Job ServerA popular way to run jobs is to also use REST APIs. If you are usingDatabricks, you can run your jobs using the Databricks REST APIs. Ifyou prefer to manage your own job server, a popular open source RESTJob Server is spark-jobserver - a RESTful interface for submitting andmanaging Apache Spark jobs, jars, and job contexts. The projectrecently (at the time of writing) was updated so it can handle PySparkjobs.For more information, please refer to https://github.com/spark-jobserver/spark-jobserver.https://www.iteblog.comhttps://github.com/spark-jobserver/spark-jobserverSummaryIn this chapter, we walked you through the steps on how to submitapplications written in Python to Spark from the command line. Theselection of the spark-submit parameters has been discussed. We alsoshowed you how you can package your Python code and submit italongside your PySpark script. Furthermore, we showed you how youcan track the execution of your job.In addition, we also provided a quick overview of how to run Databricksnotebooks using the Databricks Jobs feature. This feature simplifies thetransition from development to production, allowing you to take yournotebook and execute it as an end-to-end workflow.This brings us to the end of this book. We hope you enjoyed the journey,and that the material contained herein will help you start working withSpark using Python. Good luck!https://www.iteblog.comIndexAaction / Internal workings of an RDDactionsreference link / Resilient Distributed Datasetabout / Actions.take(...) method / The .take(...) method.collect() method / The .collect(...) method.reduce(...) method / The .reduce(...) method.count() method / The .count(...) method.saveAsTextFile(...) method / The .saveAsTextFile(...) method.foreach(...) method / The .foreach(...) methodairportCodesURL / Preparing your flights datasetairport rankingdetermining, with PageRank / Determining airport ranking usingPageRankAirports D3 visualizationURL / Visualizing flights using D3Apache Sparkabout / What is Apache Spark?reference link / What is Apache Spark?URL, for issues / What is the Spark Streaming application dataflow?Apache Spark 2.0 Architectureabout / Spark 2.0 architecturereferences / Spark 2.0 architectureDatasets, unifying with DataFrames / Unifying Datasets andDataFramesSparkSession / Introducing SparkSessionProject Tungsten 2 / Tungsten phase 2Structured Streaming / Structured Streamingcontinuous applications / Continuous applicationsApache Spark APIshttps://www.iteblog.comabout / Spark Jobs and APIsResilient Distributed Dataset (RDD) / Resilient DistributedDatasetDatasets / DatasetsCatalyst Optimizer / Catalyst OptimizerApache Spark Jobsabout / Spark Jobs and APIsexecution process / Execution processDataFrames / DataFramesProject Tungsten / Project Tungstenapplication, deployingabout / Deploying the app programmaticallySparksession, configuring / Configuring your SparkSessionSparksession, creating / Creating SparkSessioncode, modularizing / Modularizing codejob, submitting / Submitting a jobexecution, monitoring / Monitoring executionassociative / The .reduce(...) methodBbcolz formatURL / Working with filesbirth dataURL / Overview of the package, Predicting the chances ofinfant survival with MLloading / Loading and transforming the datatransforming / Loading and transforming the dataknowledge, obtaining / Getting to know your datadescriptive statistics, calculating / Descriptive statisticscorrelations, calculating / Correlationsstatistical test, executing / Statistical testingfinal dataset, creating / Creating the final datasetRDD, creating of LabeledPoints / Creating an RDD ofLabeledPointssplitting, into training and testing / Splitting into training andtestinghttps://www.iteblog.comBlazeinstalling / Installing Blazeblock-wise reducing operationsabout / Blockwise reducing operations exampleDataFrame, building of vectors / Building a DataFrame ofvectorsDataFrame, analyzing / Analysing the DataFrameelementwise sum, computing of vectors / Computingelementwise sum and min of all vectorselementwise min, computing of vectors / Computingelementwise sum and min of all vectorsBreadth-first search (BFS)about / Using Breadth-First Searchusing / Using Breadth-First SearchBureau of Transportation Statistics (BTS) / Preparing your flightsdatasetC.collect() method / The .collect(...) method.count() method / The .count(...) methodCatalyst Optimizerabout / Catalyst Optimizer, Catalyst Optimizer refreshreferences / Catalyst Optimizerreference link / Catalyst Optimizer refreshChi-squareURL / Transformerclassificationabout / Classification, ClassificationLogisticRegression / ClassificationDecisionTreeClassifier / ClassificationGBTClassifier / ClassificationRandomForestClassifier / ClassificationNaiveBayes / ClassificationMultilayerPerceptronClassifier / ClassificationOneVsRest / Classificationclusteringhttps://www.iteblog.comabout / Clustering, ClusteringBisectingKMeans / ClusteringKMeans / ClusteringGaussianMixture / ClusteringLDA / Clusteringclusters, searching in births dataset / Finding clusters in thebirths datasettopic mining / Topic miningcode, application deployingmodule structure / Structure of the moduledistance, calculating between two points / Calculating thedistance between two pointsdistance units, converting / Converting distance units.egg, building / Building an egguser defined functions, in Spark / User defined functions inSparkCode Generation / Tungsten phase 2command line, parametersmaster / Command line parametersdeploy mode / Command line parametersname / Command line parameterspy-files / Command line parametersfiles / Command line parametersconf / Command line parametersproperties-file / Command line parametersdriver-memory / Command line parametersexecutor-memory / Command line parametershelp / Command line parametersverbose / Command line parametersversion / Command line parameterskill / Command line parameterssupervise / Command line parametersstatus / Command line parameterscommutative / The .reduce(...) methodconstantsused, for matrix multiplication / Matrix multiplication usinghttps://www.iteblog.comconstantscontinuous applications / Continuous applicationscontinuous variablesdiscretizing / Discretizing continuous variablesstandardizing / Standardizing continuous variablescorrelationscalculating / CorrelationsCost-based Optimizer frameworkreferences / Catalyst Optimizer refreshD.distinct() transformation / The .distinct(...) transformationDAG schedulerreference link / Execution processdata abstractionabout / Abstracting dataNumPy arrays, working with / Working with NumPy arrayspandas' DataFrame, working / Working with pandas' DataFramepandas DataFrame, working / Working with pandas' DataFramefiles, working with / Working with filesdatabases, working with / Working with databasesdatabasesworking, with / Working with databasesrelational databases, interacting with / Interacting with relationaldatabasesMongoDB databases, interacting with / Interacting with theMongoDB databasedatabricks/tensorframes GitHub repositoryURL / TensorFrames – quick startDatabricks Community Editionreference link / DataFrame scenario – on-time flightperformanceURL / Installing GraphFramesDatabricks Jobs / Databricks JobsDataFramereference link / NLP - related feature extractorshttps://www.iteblog.comDataFrame APIquerying, with / Querying with the DataFrame APInumber of rows / Number of rowsfilter statements, executing / Running filter statementsDataFrames / DataFramesPySpark, speeding up with / Speeding up PySpark withDataFramesreference link / Speeding up PySpark with DataFrames,Creating a temporary tablecreating / Creating DataFrames, Creating a DataFramecustom JSON data, creating / Generating our own JSON datatemporary table, creating / Creating a temporary tablesimple queries, executing / Simple DataFrame queriesDataFrame API query, using / DataFrame API querySQL query, writing / SQL queryDataFrames, relating with TungstenURL / Speeding up PySpark with DataFramesdata lineageURL / Resilient Distributed Datasetdata operationsabout / Data operationscolumns, accessing / Accessing columnssymbolic transformations / Symbolic transformationsoperations, performing on columns / Operations on columnsdata, reducing / Reducing datajoins / Joinsdatasetabout / Getting familiar with your dataURL / Getting familiar with your datadescriptive statistics, calculating / Descriptive statisticscorrelations, calculating / CorrelationsDatasets / Datasetsunifying, with DataFrames / Unifying Datasets and DataFramesDeep Learningabout / What is Deep Learning?need for / The need for neural networks and Deep Learninghttps://www.iteblog.comreference link / The need for neural networks and DeepLearning, Introducing TensorFramesfeature engineering / What is feature engineering?data and algorithm, bridging / Bridging the data and algorithmdepartureDelays.csvURL / Preparing your flights datasetdescriptive statisticscalculating / Descriptive statisticsdistributed computingadvances / The need for neural networks and Deep Learningavailability / The need for neural networks and Deep LearningDistributed File System (HDFS) / Creating DataFramesDstreamsused, for Spark Streaming / Simple streaming application usingDStreamsDStreamsabout / Loading and transforming the dataURL / Loading and transforming the dataduplicateschecking for / DuplicatesEedges / Preparing your flights datasetestimatorsabout / Estimatorsclassification / Classificationregression / Regressionclustering / ClusteringF.filter(...) transformation / The .filter(...) transformation.flatMap(...) transformation / The .flatMap(...) transformation.foreach(...) method / The .foreach(...) method.format(...)URL / Interacting with relational databasesFaster Stateful Stream Processinghttps://www.iteblog.comURL / A quick primer on global aggregationsfeature engineeringabout / The need for neural networks and Deep Learning, Whatis feature engineering?feature extractionabout / Feature extraction, What is feature engineering?NLP related feature extractors / NLP - related featureextractorscontinuous variables, discretizing / Discretizing continuousvariablescontinuous variables, standardizing / Standardizing continuousvariablesreference link / What is feature engineering?feature learningabout / The need for neural networks and Deep Learningfeaturesabout / The need for neural networks and Deep Learningrestaurant recommendations / The need for neural networks andDeep Learningreferences / The need for neural networks and Deep LearningHandwritten Digit recognition / The need for neural networksand Deep LearningImage Processing / The need for neural networks and DeepLearningfeature selectionabout / What is feature engineering?flights datasetpreparing / Preparing your flights datasetfunctionsURL / DuplicatesGglobal aggregationsabout / A quick primer on global aggregationsURL / A quick primer on global aggregationsGradient Boosted Trees / Classificationhttps://www.iteblog.comgraphbuilding / Building the graphGraphFramesabout / Introducing GraphFramesreferences / Introducing GraphFramesinstalling / Installing GraphFrameslibrary, creating / Creating a librarygraph queriesexecuting / Executing simple queriesnumber of airports, determining / Determining the number ofairports and tripsnumber of trips, determining / Determining the number ofairports and tripslongest delay, determining in dataset / Determining the longestdelay in this datasetnumber of delayed flights, versus on-time flights, determining /Determining the number of delayed versus on-time/early flightsfilght delays, determining / What flights departing Seattle aremost likely to have significant delays?states, determining for significant delays from SEA / Whatstates tend to have significant delays departing from Seattle?grid searchabout / Grid searchHHartsfield-Jackson Atlanta International Airport (ATL)references / Determining airport ranking using PageRankHaversine formulaURL / Modularizing codehistogramsabout / Histogramshyperparametersreference link / Introducing TensorFramesIIncremental Execution Planhttps://www.iteblog.comabout / Introducing Structured Streaminginfant survivalpredicting / Predicting infant survivalpredicting, with logistic regression in MLlib / Logistic regressionin MLlibmost predictable features, selecting / Selecting only the mostpredictable featurespredicting, with random forest in MLlib / Random forest inMLlibinfant survival prediction, with MLperforming / Predicting the chances of infant survival with MLdata, loading / Loading the datatransformers, creating / Creating transformersestimator, creating / Creating an estimatorpipeline, creating / Creating a pipelinemodel, fitting / Fitting the modelperformance, evaluating / Evaluating the performance of themodelmodel, saving / Saving the modelInternational Air Transport Association(IATA) code / Joining flightperformance and airportsInverse Document Frequency (IDF) / TransformerJjoinsabout / JoinsL.leftOuterJoin(...) transformation / The .leftOuterJoin(...)transformationL1-Normreference link / Descriptive statisticsL2-Normreference link / Descriptive statisticsLambda expressionsabout / Lambda expressionshttps://www.iteblog.comreference link / Lambda expressions, The .map(...)transformationLatent Dirichlet Allocationabout / Topic mininglearning PySparkURL / DataFrame scenario – on-time flight performanceLimited-memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) /Logistic regression in MLlibURL / Logistic regression in MLliblines DStream / A quick primer on global aggregationslocal mode,versus cluster modeURL / Global versus local scopelogistic regressionused, for predicting infant survival / Logistic regression inMLlibM.map(...) transformation / The .map(...) transformationmatrix multiplicationwith constants / Matrix multiplication using constantswith placeholders / Matrix multiplication using placeholders,Running another modelMaven repositoryURL / Creating a libraryMeetup Streaming APIURL / What is Spark Streaming?missing observationschecking / Missing observationsMLliboverview / Overview of the packagedata preparation / Overview of the packagemachine learning algorithms / Overview of the packageutilities / Overview of the packageinfant survival, predicting with logistic regression / Logisticregression in MLlibinfant survival, predicting with random forest / Random foresthttps://www.iteblog.comin MLlibML packageoverview / Overview of the packageTransformer class / Transformerestimators / EstimatorsPipeline / Pipelinefeatures / Other features of PySpark ML in actionfeature extraction / Feature extractionclassification / Classificationclustering / Clusteringregression / RegressionMongoDBURL / Working with databasesMongoDB databaseinteracting, with / Interacting with the MongoDB databaseMortality datasetURL / Creating RDDsmotifsabout / Understanding motifsNneural networksreference link / What is Deep Learning?need for / The need for neural networks and Deep LearningNLP related feature extractorsabout / NLP - related feature extractorsNumPy arraysworking with / Working with NumPy arraysOodoURL / Working with databaseson-time flight performancereferences / Preparing your flights dataseton-time flight performance, use casesabout / DataFrame scenario – on-time flight performancehttps://www.iteblog.comsource datasets, preparing / Preparing the source datasetsflight performance, joining / Joining flight performance andairportsairports, joining / Joining flight performance and airportsdata, visualizing / Visualizing our flight-performance dataon time flight performancepopular non-stop flights, determining / Determining the mostpopular non-stop flightsreference link / Determining the most popular non-stop flights,Visualizing flights using D3flights, visualizing with D3 / Visualizing flights using D3outlierschecking / OutliersPPageRankairport ranking, determining / Determining airport ranking usingPageRankreference link / Determining airport ranking using PageRankpandas' DataFrameworking, with / Working with pandas' DataFrameparameter hyper-tuningabout / Parameter hyper-tuninggrid search / Grid searchtrain-validation splitting / Train-validation splittingpipinstalling / Installing PipPipelineabout / Pipelineplaceholdersused, for matrix multiplication / Matrix multiplication usingplaceholders, Running another modelPolyglot persistenceabout / Polyglot persistencereferences / Polyglot persistencePopulation vs. Price Linear Regression Job / Databricks Jobshttps://www.iteblog.comPostgreSQLURL / Working with databasesprincipal component analysis (PCA)about / What is feature engineering?URL / What is feature engineering?project management committee (PMC)about / Why do we need Spark Streaming?Project Tungstenabout / Project Tungsten, Catalyst Optimizer refreshreferences / Project Tungsten, Tungsten phase 2improvements / Tungsten phase 2reference link / Catalyst Optimizer refreshProject Tungsten 2about / Tungsten phase 2improvements / Tungsten phase 2pseudo-algorithmURL / ClusteringPySparkspeeding up, with DataFrames / Speeding up PySpark withDataFramespyspark.sql.DataFrameURL / Visualizing our flight-performance datapyspark.sql.functionsURL / Visualizing our flight-performance datapyspark.sql.typesURL / Descriptive statisticsPySpark performance, improvingURL / Python to RDD communicationsPythoncommunicating, to RDD / Python to RDD communicationsPython DatasetURL / Spark Dataset APIR.reduce(...) method / The .reduce(...) method.repartition(...) transformation / The .repartition(...) transformationhttps://www.iteblog.comrandom forestused, for predicting infant survival / Random forest in MLlibReceiver-Operating Characteristic (ROC)about / Logistic regression in MLlibURL / Logistic regression in MLlibrecord schemaURL / Creating RDDsregressionabout / Regression, RegressionAFTSurvivalRegression / RegressionDecisionTreeRegressor / RegressionGBTRegressor / RegressionGeneralizedLinearRegression / RegressionIsotonicRegression / RegressionLinearRegression / RegressionRandomForestRegressor / RegressionRegular Expressionsreference link / Lambda expressionsRelational Database Management System (RDBMS) / Polyglotpersistencerelational database management system (RDBMS)about / Catalyst Optimizer refreshrelational databasesinteracting, with / Interacting with relational databasesResilient Distributed Datasets (RDDs / Resilient Distributed DatasetResilient Distributed Datasets (RDDs) / Resilient DistributedDatasetinternal functions / Internal workings of an RDDcreating / Creating RDDsschema / Schemafiles, reading from / Reading from filesLambda expressions / Lambda expressionsglobal scope, versus local scope / Global versus local scopecommunicating, to Python / Python to RDD communicationsinteroperating, with / Interoperating with RDDsschema, inferring with reflection / Inferring the schema usinghttps://www.iteblog.comreflectionschema, specifying programmatically / Programmaticallyspecifying the schemaRow object / SQL queryS.sample(...) transformation / The .sample(...) transformation.saveAsTextFile(...) method / The .saveAsTextFile(...) methodS3 FileStream Wordcount (Databricks notebook)URL / Simple streaming application using DStreamssetup.py filesURL / Structure of the modulespark-submit commandabout / The spark-submit commandURL / The spark-submit commandcommand line parameters / Command line parametersSpark Dataset APIabout / Spark Dataset APISpark PackagesURL / Introducing TensorFramesSpark performanceURL, for Scala vs Python / User defined functions in Sparkspark rddreference link, for removing elements / Lambda expressionsSparksessionconfiguring / Configuring your SparkSessioncreating / Creating SparkSessionSparkSessionabout / Introducing SparkSessionSpark Streamingabout / What is Spark Streaming?URL / What is Spark Streaming?, A quick primer on globalaggregationsreference link / What is Spark Streaming?need for / Why do we need Spark Streaming?references / Why do we need Spark Streaming?https://www.iteblog.comuse cases / Why do we need Spark Streaming?application data flow / What is the Spark Streaming applicationdata flow?DStreams, using / Simple streaming application using DStreamsSpark Streaming, use casesStreaming ETL / Why do we need Spark Streaming?triggers / Why do we need Spark Streaming?data enrichment / Why do we need Spark Streaming?complex sessions / Why do we need Spark Streaming?continuous learning / Why do we need Spark Streaming?SQLquerying, with / Querying with SQLnumber of rows / Number of rowsfilter statement, executing with where clause / Running filterstatements using the where Clausesreferences / Running filter statements using the where ClausesStateful Network Wordcount PythonURL / A quick primer on global aggregationsStateful StreamingURL / A quick primer on global aggregationsstatistical modelreference link / Descriptive statisticsstochastic gradient descent (SGD) / Logistic regression in MLlibStructured Streamingabout / Structured Streaming, Introducing Structured Streamingreference link / Structured StreamingURL / Introducing Structured StreamingStructuring SparkURL / Catalyst Optimizer refreshT.take(...) method / The .take(...) methodTensorFlowabout / What is TensorFlow?URL / What is TensorFlow?, Installing TensorFlowpip, installing / InstallingPiphttps://www.iteblog.cominstalling / Installing TensorFlowmatrix multiplication, with constants / Matrix multiplicationusing constantsmatrix multiplication, with placeholders / Matrix multiplicationusing placeholders, Running another modelreferences / Discussionconstant, adding / Using TensorFlow to add a constant to anexisting columntensor graph, executing / Executing the Tensor graphTensorFramesabout / Introducing TensorFramesTensorFlow, utilizing with data / Introducing TensorFramesoptimal hyperparameters, determining via parallel training /Introducing TensorFramesusing / TensorFrames – quick startreference link / TensorFrames – quick start, Using TensorFlowto add a constant to an existing columnconfiguration / Configuration and setupsetup / Configuration and setupSpark cluster, launching / Launching a Spark clusterlibrary, creating / Creating a TensorFrames libraryTensorFlow, installing on cluster / Installing TensorFlow on yourclusterconstant, adding with TensorFlow / Using TensorFlow to add aconstant to an existing columnblock-wise reducing operations / Blockwise reducing operationsexampletf.reduce_minabout / Computing elementwise sum and min of all vectorsURL / Computing elementwise sum and min of all vectorstf.reduce_sumabout / Computing elementwise sum and min of all vectorsURL / Computing elementwise sum and min of all vectorstopic miningabout / Topic miningtop transfer airportshttps://www.iteblog.comdetermining / Determining the top transfer airportsTraffic ViolationsURL / Working with filestrain-validation splittingabout / Train-validation splittingtransform / Internal workings of an RDDtransformationsreference link / Resilient Distributed Datasetabout / TransformationsURL, for methods / Transformations.map(...) / The .map(...) transformation.filter(...) / The .filter(...) transformation.flatMap(...) / The .flatMap(...) transformation.distinct() / The .distinct(...) transformation.sample(...) / The .sample(...) transformation.leftOuterJoin(...) / The .leftOuterJoin(...) transformation.repartition(...) / The .repartition(...) transformationTransformer classabout / TransformerBinarizer / TransformerChiSqSelector / TransformerCountVectorizer / TransformerDCT / TransformerElementwiseProduct / TransformerHashingTF / TransformerIDF / TransformerIndexToString / TransformerMaxAbsScaler / TransformerMinMaxScaler / TransformerNGram / TransformerNormalizer / TransformerOneHotEncoder / TransformerPCA / TransformerPolynomialExpansion / TransformerQuantileDiscretizer / TransformerRegexTokenizer / Transformerhttps://www.iteblog.comRFormula / TransformerSQLTransformer / TransformerStandardScaler / TransformerStopWordsRemover / TransformerStringIndexer / TransformerTokenizer / TransformerVectorAssembler / TransformerVectorIndexer / TransformerVectorSlicer / TransformerWord2Vec / TransformerUUniform Resource Identifier (URI)about / Interacting with relational databasesVvertex degreesstates, deabout / Understanding vertex degreesvertices / Preparing your flights datasetvisualizationabout / Visualizationhistograms / Histogramsused, for interacting between features / Interactions betweenfeaturesVS14MORT.txt fileURL / Creating RDDsWwhere clausefilter statement, executing / Running filter statements using thewhere ClausesWord countURL / A quick primer on global aggregationsYYARN clusterhttps://www.iteblog.comqueue parameter / Command line parametersnum-executors parameter / Command line parametershttps://www.iteblog.comTable of ContentsLearning PySpark 11Credits 12Foreword 14About the Authors 16About the Reviewer 18www.PacktPub.com 19Customer Feedback 20Preface 21What this book covers 21What you need for this book 24Who this book is for 25Conventions 26Reader feedback 27Customer support 28Downloading the example code 28Downloading the color images of this book 29Errata 29Piracy 29Questions 301. Understanding Spark 31What is Apache Spark? 31Spark Jobs and APIs 34Execution process 34Resilient Distributed Dataset 35DataFrames 37Datasets 38Catalyst Optimizer 38Project Tungsten 39Spark 2.0 architecture 41Unifying Datasets and DataFrames 42https://www.iteblog.comIntroducing SparkSession 44Tungsten phase 2 45Structured Streaming 47Continuous applications 48Summary 502. Resilient Distributed Datasets 51Internal workings of an RDD 51Creating RDDs 53Schema 54Reading from files 55Lambda expressions 55Global versus local scope 58Transformations 60The .map(...) transformation 60The .filter(...) transformation 61The .flatMap(...) transformation 62The .distinct(...) transformation 62The .sample(...) transformation 63The .leftOuterJoin(...) transformation 64The .repartition(...) transformation 65Actions 67The .take(...) method 67The .collect(...) method 67The .reduce(...) method 67The .count(...) method 69The .saveAsTextFile(...) method 70The .foreach(...) method 70Summary 723. DataFrames 73Python to RDD communications 74Catalyst Optimizer refresh 76Speeding up PySpark with DataFrames 78https://www.iteblog.comCreating DataFrames 80Generating our own JSON data 80Creating a DataFrame 81Creating a temporary table 81Simple DataFrame queries 86DataFrame API query 86SQL query 86Interoperating with RDDs 88Inferring the schema using reflection 88Programmatically specifying the schema 89Querying with the DataFrame API 91Number of rows 91Running filter statements 91Querying with SQL 93Number of rows 93Running filter statements using the where Clauses 93DataFrame scenario – on-time flight performance 96Preparing the source datasets 96Joining flight performance and airports 97Visualizing our flight-performance data 99Spark Dataset API 101Summary 1034. Prepare Data for Modeling 104Checking for duplicates, missing observations, and outliers 104Duplicates 105Missing observations 110Outliers 116Getting familiar with your data 120Descriptive statistics 120Correlations 124Visualization 127Histograms 127https://www.iteblog.comInteractions between features 131Summary 1345. Introducing MLlib 135Overview of the package 135Loading and transforming the data 137Getting to know your data 143Descriptive statistics 143Correlations 146Statistical testing 148Creating the final dataset 150Creating an RDD of LabeledPoints 150Splitting into training and testing 151Predicting infant survival 152Logistic regression in MLlib 152Selecting only the most predictable features 154Random forest in MLlib 155Summary 1576. Introducing the ML Package 158Overview of the package 158Transformer 158Estimators 163Classification 163Regression 164Clustering 165Pipeline 166Predicting the chances of infant survival with ML 168Loading the data 168Creating transformers 169Creating an estimator 170Creating a pipeline 170Fitting the model 171https://www.iteblog.comEvaluating the performance of the model 173Saving the model 174Parameter hyper-tuning 176Grid search 176Train-validation splitting 180Other features of PySpark ML in action 182Feature extraction 182NLP - related feature extractors 182Discretizing continuous variables 184Standardizing continuous variables 187Classification 189Clustering 191Finding clusters in the births dataset 191Topic mining 192Regression 195Summary 1987. GraphFrames 199Introducing GraphFrames 202Installing GraphFrames 204Creating a library 204Preparing your flights dataset 209Building the graph 212Executing simple queries 214Determining the number of airports and trips 214Determining the longest delay in this dataset 214Determining the number of delayed versus on-time/early flights 215What flights departing Seattle are most likely to have significantdelays? 215What states tend to have significant delays departing from Seattle? 216Understanding vertex degrees 218Determining the top transfer airports 220https://www.iteblog.comthedriver after running a computation); we will cover these in greater detailin later chapters.NoteFor the latest list of transformations and actions, please refer to theSpark Programming Guide athttp://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.https://www.iteblog.comhttp://ibm.co/2ao9B1thttp://spark.apache.org/docs/latest/programming-guide.html#rdd-operationsRDD transformation operations are lazy in a sense that they do notcompute their results immediately. The transformations are onlycomputed when an action is executed and the results need to bereturned to the driver. This delayed execution results in more fine-tunedqueries: Queries that are optimized for performance. This optimizationstarts with Apache Spark's DAGScheduler – the stage oriented schedulerthat transforms using stages as seen in the preceding screenshot. Byhaving separate RDD transformations and actions, the DAGSchedulercan perform optimizations in the query including being able to avoidshuffling, the data (the most resource intensive task).For more information on the DAGScheduler and optimizations(specifically around narrow or wide dependencies), a great reference isthe Narrow vs. Wide Transformations section in High PerformanceSpark in Chapter 5, Effective Transformations(https://smile.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203).DataFramesDataFrames, like RDDs, are immutable collections of data distributedamong the nodes in a cluster. However, unlike RDDs, in DataFramesdata is organized into named columns.NoteIf you are familiar with Python's pandas or R data.frames, this is asimilar concept.DataFrames were designed to make large data sets processing eveneasier. They allow developers to formalize the structure of the data,allowing higher-level abstraction; in that sense DataFrames resembletables from the relational database world. DataFrames provide a domainspecific language API to manipulate the distributed data and make Sparkaccessible to a wider audience, beyond specialized data engineers.One of the major benefits of DataFrames is that the Spark enginehttps://www.iteblog.comhttps://smile.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203initially builds a logical execution plan and executes generated codebased on a physical plan determined by a cost optimizer. Unlike RDDsthat can be significantly slower on Python compared with Java or Scala,the introduction of DataFrames has brought performance parity acrossall the languages.DatasetsIntroduced in Spark 1.6, the goal of Spark Datasets is to provide an APIthat allows users to easily express transformations on domain objects,while also providing the performance and benefits of the robust SparkSQL execution engine. Unfortunately, at the time of writing this bookDatasets are only available in Scala or Java. When they are available inPySpark we will cover them in future editions.Catalyst OptimizerSpark SQL is one of the most technically involved components ofApache Spark as it powers both SQL queries and the DataFrame API.At the core of Spark SQL is the Catalyst Optimizer. The optimizer isbased on functional programming constructs and was designed with twopurposes in mind: To ease the addition of new optimization techniquesand features to Spark SQL and to allow external developers to extendthe optimizer (for example, adding data source specific rules, support fornew data types, and so on):https://www.iteblog.comNoteFor more information, check out Deep Dive into Spark SQL's CatalystOptimizer (http://bit.ly/271I7Dk) and Apache Spark DataFrames:Simple and Fast Analysis of Structured Data (http://bit.ly/29QbcOV)Project TungstenTungsten is the codename for an umbrella project of Apache Spark'sexecution engine. The project focuses on improving the Sparkalgorithms so they use memory and CPU more efficiently, pushing theperformance of modern hardware closer to its limits.The efforts of this project focus, among others, on:Managing memory explicitly so the overhead of JVM's object modeland garbage collection are eliminatedDesigning algorithms and data structures that exploit the memoryhierarchyGenerating code in runtime so the applications can exploit moderncompliers and optimize for CPUsEliminating virtual function dispatches so that multiple CPU callsare reducedUtilizing low-level programming (for example, loading immediatedata to CPU registers) speed up the memory access and optimizingSpark's engine to efficiently compile and execute simple loopsNoteFor more information, please refer toProject Tungsten: Bringing Apache Spark Closer to Bare Metal(https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html)Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal[SSE 2015 Video and Slides] (https://spark-https://www.iteblog.comhttp://bit.ly/271I7Dkhttp://bit.ly/29QbcOVhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal/summit.org/2015/events/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal/) andApache Spark as a Compiler: Joining a Billion Rows per Second on aLaptop (https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html)https://www.iteblog.comhttps://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.htmlSpark 2.0 architectureThe introduction of Apache Spark 2.0 is the recent major release of theApache Spark project based on the key learnings from the last two yearsof development of the platform:Source: Apache Spark 2.0: Faster, Easier, and Smarterhttp://bit.ly/2ap7qd5The three overriding themes of the Apache Spark 2.0 release surroundperformance enhancements (via Tungsten Phase 2), the introduction ofstructured streaming, and unifying Datasets and DataFrames. We willdescribe the Datasets as they are part of Spark 2.0 even though they arecurrently only available in Scala and Java.NoteRefer to the following presentations by key Spark committers for moreinformation about Apache Spark 2.0:Reynold Xin's Apache Spark 2.0: Faster, Easier, and Smarter webinarhttp://bit.ly/2ap7qd5Michael Armbrust's Structuring Spark: DataFrames, Datasets, andStreaming http://bit.ly/2ap7qd5Tathagata Das' A Deep Dive into Spark Streaming http://bit.ly/2aHt1w0https://www.iteblog.comhttp://bit.ly/2ap7qd5http://bit.ly/2ap7qd5http://bit.ly/2ap7qd5http://bit.ly/2aHt1w0Joseph Bradley's Apache Spark MLlib 2.0 Preview: Data Science andProduction http://bit.ly/2aHrOVNUnifying Datasets and DataFramesIn the previous section, we stated out that Datasets (at the time ofwriting this book) are only available in Scala or Java. However, we areproviding the following context to better understand the direction ofSpark 2.0.Datasets were introduced in 2015 as part of the Apache Spark 1.6release. The goal for datasets was to provide a type-safe, programminginterface. This allowed developers to work with semi-structured data(like JSON or key-value pairs) with compile time type safety (that is,production applications can be checked for errors before they run). Partof the reason why Python does not implement a Dataset API is becausePython is not a type-safe language.Just as important, the Datasets API contain high-level domain specificlanguage operations such as sum(), avg(), join(), and group(). Thislatter trait means that you have the flexibility of traditional Spark RDDsbut the code is also easier to express, read, and write. Similar toDataFrames, Datasets can take advantage of Spark's catalyst optimizerby exposing expressions and data fields to a query planner and makinguse of Tungsten's fast in-memory encoding.The history of the Spark APIs isUnderstanding motifs 222Determining airport ranking using PageRank 225Determining the most popular non-stop flights 227Using Breadth-First Search 229Visualizing flights using D3 232Summary 2358. TensorFrames 236What is Deep Learning? 236The need for neural networks and Deep Learning 241What is feature engineering? 244Bridging the data and algorithm 245What is TensorFlow? 248Installing Pip 251Installing TensorFlow 251Matrix multiplication using constants 252Matrix multiplication using placeholders 254Running the model 256Running another model 256Discussion 257Introducing TensorFrames 259TensorFrames – quick start 263Configuration and setup 263Launching a Spark cluster 263Creating a TensorFrames library 263Installing TensorFlow on your cluster 263Using TensorFlow to add a constant to an existing column 265Executing the Tensor graph 266Blockwise reducing operations example 267Building a DataFrame of vectors 267Analysing the DataFrame 268Computing elementwise sum and min of all vectors 269https://www.iteblog.comSummary 2719. Polyglot Persistence with Blaze 272Installing Blaze 273Polyglot persistence 274Abstracting data 276Working with NumPy arrays 276Working with pandas' DataFrame 278Working with files 279Working with databases 283Interacting with relational databases 284Interacting with the MongoDB database 285Data operations 287Accessing columns 287Symbolic transformations 288Operations on columns 290Reducing data 292Joins 295Summary 29910. Structured Streaming 300What is Spark Streaming? 300Why do we need Spark Streaming? 305What is the Spark Streaming application data flow? 307Simple streaming application using DStreams 309A quick primer on global aggregations 315Introducing Structured Streaming 322Summary 32711. Packaging Spark Applications 329The spark-submit command 329Command line parameters 330Deploying the app programmatically 334Configuring your SparkSession 334https://www.iteblog.comCreating SparkSession 335Modularizing code 336Structure of the module 336Calculating the distance between two points 338Converting distance units 338Building an egg 339User defined functions in Spark 339Submitting a job 341Monitoring execution 343Databricks Jobs 346Summary 351Index 352https://www.iteblog.comLearning PySparkCreditsForewordAbout the AuthorsAbout the Reviewerwww.PacktPub.comCustomer FeedbackPrefaceWhat this book coversWhat you need for this bookWho this book is forConventionsReader feedbackCustomer supportDownloading the example codeDownloading the color images of this bookErrataPiracyQuestions1. Understanding SparkWhat is Apache Spark?Spark Jobs and APIsExecution processResilient Distributed DatasetDataFramesDatasetsCatalyst OptimizerProject TungstenSpark 2.0 architectureUnifying Datasets and DataFramesIntroducing SparkSessionTungsten phase 2Structured StreamingContinuous applicationsSummary2. Resilient Distributed DatasetsInternal workings of an RDDCreating RDDsSchemaReading from filesLambda expressionsGlobal versus local scopeTransformationsThe .map(...) transformationThe .filter(...) transformationThe .flatMap(...) transformationThe .distinct(...) transformationThe .sample(...) transformationThe .leftOuterJoin(...) transformationThe .repartition(...) transformationActionsThe .take(...) methodThe .collect(...) methodThe .reduce(...) methodThe .count(...) methodThe .saveAsTextFile(...) methodThe .foreach(...) methodSummary3. DataFramesPython to RDD communicationsCatalyst Optimizer refreshSpeeding up PySpark with DataFramesCreating DataFramesGenerating our own JSON dataCreating a DataFrameCreating a temporary tableSimple DataFrame queriesDataFrame API querySQL queryInteroperating with RDDsInferring the schema using reflectionProgrammatically specifying the schemaQuerying with the DataFrame APINumber of rowsRunning filter statementsQuerying with SQLNumber of rowsRunning filter statements using the where ClausesDataFrame scenario – on-time flight performancePreparing the source datasetsJoining flight performance and airportsVisualizing our flight-performance dataSpark Dataset APISummary4. Prepare Data for ModelingChecking for duplicates, missing observations, and outliersDuplicatesMissing observationsOutliersGetting familiar with your dataDescriptive statisticsCorrelationsVisualizationHistogramsInteractions between featuresSummary5. Introducing MLlibOverview of the packageLoading and transforming the dataGetting to know your dataDescriptive statisticsCorrelationsStatistical testingCreating the final datasetCreating an RDD of LabeledPointsSplitting into training and testingPredicting infant survivalLogistic regression in MLlibSelecting only the most predictable featuresRandom forest in MLlibSummary6. Introducing the ML PackageOverview of the packageTransformerEstimatorsClassificationRegressionClusteringPipelinePredicting the chances of infant survival with MLLoading the dataCreating transformersCreating an estimatorCreating a pipelineFitting the modelEvaluating the performance of the modelSaving the modelParameter hyper-tuningGrid searchTrain-validation splittingOther features of PySpark ML in actionFeature extractionNLP - related feature extractorsDiscretizing continuous variablesStandardizing continuous variablesClassificationClusteringFinding clusters in the births datasetTopic miningRegressionSummary7. GraphFramesIntroducing GraphFramesInstalling GraphFramesCreating a libraryPreparing your flights datasetBuilding the graphExecuting simple queriesDetermining the number of airports and tripsDetermining the longest delay in this datasetDetermining the number of delayed versus on-time/early flightsWhat flights departing Seattle are most likely to have significant delays?What states tend to have significant delays departing from Seattle?Understanding vertex degreesDetermining the top transfer airportsUnderstanding motifsDetermining airport ranking using PageRankDetermining the most popular non-stop flightsUsing Breadth-First SearchVisualizing flights using D3Summary8. TensorFramesWhat is Deep Learning?The need for neural networks and Deep LearningWhat is feature engineering?Bridging the data and algorithmWhat is TensorFlow?Installing PipInstalling TensorFlowMatrix multiplication using constantsMatrix multiplication using placeholdersRunning the modelRunning another modelDiscussionIntroducing TensorFramesTensorFrames – quick startConfiguration and setupLaunching a Spark clusterCreating a TensorFrames libraryInstalling TensorFlow on your clusterUsing TensorFlow to add a constant to an existing columnExecuting the Tensor graphBlockwise reducing operations exampleBuilding a DataFrame of vectorsAnalysing the DataFrameComputing elementwise sum and min of all vectorsSummary9. Polyglot Persistence with BlazeInstalling BlazePolyglot persistenceAbstracting dataWorking with NumPy arraysWorking with pandas' DataFrameWorking with filesWorking with databasesInteracting with relational databasesInteracting with the MongoDB databaseData operationsAccessing columnsSymbolic transformationsOperations on columnsReducing dataJoinsSummary10. Structured StreamingWhat is Spark Streaming?Why do we need Spark Streaming?What is the Spark Streaming application data flow?Simple streaming application using DStreamsA quick primer on global aggregationsIntroducing Structured StreamingSummary11. Packaging Spark ApplicationsThe spark-submit commandCommand line parametersDeploying the app programmaticallyConfiguring your SparkSessionCreating SparkSessionModularizing codeStructure of the moduleCalculating the distancedenoted in the following diagram notingthe progression from RDD to DataFrame to Dataset:https://www.iteblog.comhttp://bit.ly/2aHrOVNSource: From Webinar Apache Spark 1.5: What is the differencebetween a DataFrame and a RDD? http://bit.ly/29JPJSAThe unification of the DataFrame and Dataset APIs has the potential ofcreating breaking changes to backwards compatibility. This was one ofthe main reasons Apache Spark 2.0 was a major release (as opposed to a1.x minor release which would have minimized any breaking changes).As you can see from the following diagram, DataFrame and Datasetboth belong to the new Dataset API introduced as part of Apache Spark2.0:https://www.iteblog.comhttp://bit.ly/29JPJSASource: A Tale of Three Apache Spark APIs: RDDs, DataFrames, andDatasets http://bit.ly/2accSNAAs noted previously, the Dataset API provides a type-safe, object-oriented programming interface. Datasets can take advantage of theCatalyst optimizer by exposing expressions and data fields to the queryplanner and Project Tungsten's Fast In-memory encoding. But withDataFrame and Dataset now unified as part of Apache Spark 2.0,DataFrame is now an alias for the Dataset Untyped API. Morespecifically:DataFrame = Dataset[Row]Introducing SparkSessionIn the past, you would potentially work with SparkConf, SparkContext,SQLContext, and HiveContext to execute your various Spark queries forconfiguration, Spark context, SQL context, and Hive contextrespectively. The SparkSession is essentially the combination of thesecontexts including StreamingContext.https://www.iteblog.comhttp://bit.ly/2accSNAFor example, instead of writing:df = sqlContext.read \ .format('json').load('py/test/sql/people.json')now you can write:df = spark.read.format('json').load('py/test/sql/people.json')or:df = spark.read.json('py/test/sql/people.json')The SparkSession is now the entry point for reading data, working withmetadata, configuring the session, and managing the cluster resources.Tungsten phase 2The fundamental observation of the computer hardware landscape whenthe project started was that, while there were improvements in price perperformance in RAM memory, disk, and (to an extent) networkinterfaces, the price per performance advancements for CPUs were notthe same. Though hardware manufacturers could put more cores in eachsocket (i.e. improve performance through parallelization), there were nosignificant improvements in the actual core speed.Project Tungsten was introduced in 2015 to make significant changes tothe Spark engine with the focus on improving performance. The firstphase of these improvements focused on the following facets:Memory Management and Binary Processing: Leveragingapplication semantics to manage memory explicitly and eliminatethe overhead of the JVM object model and garbage collectionCache-aware computation: Algorithms and data structures toexploit memory hierarchyCode generation: Using code generation to exploit moderncompilers and CPUsThe following diagram is the updated Catalyst engine to denote thehttps://www.iteblog.cominclusion of Datasets. As you see at the right of the diagram (right of theCost Model), Code Generation is used against the selected physicalplans to generate the underlying RDDs:Source: Structuring Spark: DataFrames, Datasets, and Streaminghttp://bit.ly/2cJ508xAs part of Tungsten Phase 2, there is the push into whole-stage codegeneration. That is, the Spark engine will now generate the byte code atcompile time for the entire Spark stage instead of just for specific jobs ortasks. The primary facets surrounding these improvements include:No virtual function dispatches: This reduces multiple CPU callsthat can have a profound impact on performance when dispatchingbillions of timesIntermediate data in memory vs CPU registers: Tungsten Phase 2places intermediate data into CPU registers. This is an order ofmagnitude reduction in the number of cycles to obtain data from theCPU registers instead of from memoryLoop unrolling and SIMD: Optimize Apache Spark's executionengine to take advantage of modern compilers and CPUs' ability toefficiently compile and execute simple for loops (as opposed tocomplex function call graphs)https://www.iteblog.comhttp://bit.ly/2cJ508xFor a more in-depth review of Project Tungsten, please refer to:Apache Spark Key Terms, Explainedhttps://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.htmlApache Spark as a Compiler: Joining a Billion Rows per Second ona Laptop https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.htmlProject Tungsten: Bringing Apache Spark Closer to Bare Metalhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlStructured StreamingAs quoted by Reynold Xin during Spark Summit East 2016:"The simplest way to perform streaming analytics is not having toreason about streaming."This is the underlying foundation for building Structured Streaming.While streaming is powerful, one of the key issues is that streaming canbe difficult to build and maintain. While companies such as Uber,Netflix, and Pinterest have Spark Streaming applications running inproduction, they also have dedicated teams to ensure the systems arehighly available.NoteFor a high-level overview of Spark Streaming, please review SparkStreaming: What Is It and Who's Using It? http://bit.ly/1Qb10f6As implied previously, there are many things that can go wrong whenoperating Spark Streaming (and any streaming system for that matter)including (but not limited to) late events, partial outputs to the final datasource, state recovery on failure, and/or distributed reads/writes:https://www.iteblog.comhttps://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.htmlhttps://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttp://bit.ly/1Qb10f6Source: A Deep Dive into Structured Streaming http://bit.ly/2aHt1w0Therefore, to simplify Spark Streaming, there is now a single API thataddresses both batch and streaming within the Apache Spark 2.0 release.More succinctly, the high-level streaming API is now built on top of theApache Spark SQL Engine. It runs the same queries as you would withDatasets/DataFrames providing you with all the performance andoptimization benefits as well as benefits such as event time, windowing,sessions, sources, and sinks.Continuous applicationsAltogether, Apache Spark 2.0 not only unified DataFrames and Datasetsbut also unified streaming, interactive, and batch queries. This opens awhole new set of use cases including the ability to aggregate data into astream and then serving it using traditional JDBC/ODBC, to changequeries at run time, and/or to build and apply ML models in for manyscenario in a variety of latency use cases:https://www.iteblog.comhttp://bit.ly/2aHt1w0Source: Apache Spark Key Terms, Explainedhttps://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html.Together, you can now build end-to-end continuous applications, inwhich you can issue the same queries to batch processing as to real-timedata, perform ETL, generate reports, update or track specific data in thestream.NoteFor more information on continuous applications, please refer to MateiZaharia's blog post Continuous Applications: Evolving Streaming inApache Spark 2.0 - A foundation for end-to-end real-time applicationshttp://bit.ly/2aJaSOr.https://www.iteblog.comhttps://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.htmlhttp://bit.ly/2aJaSOrSummaryIn this chapter, we reviewed what is Apache Spark and provided aprimer on Spark Jobs and APIs. We also provided a primer on ResilientDistributed Datasets (RDDs), DataFrames,and Datasets; we will divefurther into RDDs and DataFrames in subsequent chapters. We alsodiscussed how DataFrames can provide faster query performance inApache Spark due to the Spark SQL Engine's Catalyst Optimizer andProject Tungsten. Finally, we also provided a high-level overview of theSpark 2.0 architecture including the Tungsten Phase 2, StructuredStreaming, and Unifying DataFrames and Datasets.In the next chapter, we will cover one of the fundamental data structuresin Spark: The Resilient Distributed Datasets, or RDDs. We will showyou how to create and modify these schema-less data structures usingtransformers and actions so your journey with PySpark can begin.Before we do that, however, please, check the linkhttp://www.tomdrabas.com/site/book for the Bonus Chapter 1 where weoutline instructions on how to install Spark locally on your machine(unless you already have it installed). Here's a direct link to the manual:https://www.packtpub.com/sites/default/files/downloads/InstallingSpark.pdfhttps://www.iteblog.comhttp://www.tomdrabas.com/site/bookhttps://www.packtpub.com/sites/default/files/downloads/InstallingSpark.pdfChapter 2. Resilient DistributedDatasetsResilient Distributed Datasets (RDDs) are a distributed collection ofimmutable JVM objects that allow you to perform calculations veryquickly, and they are the backbone of Apache Spark.As the name suggests, the dataset is distributed; it is split into chunksbased on some key and distributed to executor nodes. Doing so allowsfor running calculations against such datasets very quickly. Also, asalready mentioned in Chapter 1, Understanding Spark, RDDs keeptrack (log) of all the transformations applied to each chunk to speed upthe computations and provide a fallback if things go wrong and thatportion of the data is lost; in such cases, RDDs can recompute the data.This data lineage is another line of defense against data loss, acomplement to data replication.The following topics are covered in this chapter:Internal workings of an RDDCreating RDDsGlobal versus local scopesTransformationsActionsInternal workings of an RDDRDDs operate in parallel. This is the strongest advantage of working inSpark: Each transformation is executed in parallel for enormous increasein speed.The transformations to the dataset are lazy. This means that anytransformation is only executed when an action on a dataset is called.This helps Spark to optimize the execution. For instance, consider thefollowing very common steps that an analyst would normally do to gethttps://www.iteblog.comfamiliar with a dataset:1. Count the occurrence of distinct values in a certain column.2. Select those that start with an A.3. Print the results to the screen.As simple as the previously mentioned steps sound, if only items thatstart with the letter A are of interest, there is no point in counting distinctvalues for all the other items. Thus, instead of following the execution asoutlined in the preceding points, Spark could only count the items thatstart with A, and then print the results to the screen.Let's break this example down in code. First, we order Spark to map thevalues of A using the .map(lambda v: (v, 1)) method, and then selectthose records that start with an 'A' (using the .filter(lambda val:val.startswith('A')) method). If we call the.reduceByKey(operator.add) method it will reduce the dataset and add(in this example, count) the number of occurrences of each key. All ofthese steps transform the dataset.Second, we call the .collect() method to execute the steps. This step isan action on our dataset - it finally counts the distinct elements of thedataset. In effect, the action might reverse the order of transformationsand filter the data first before mapping, resulting in a smaller datasetbeing passed to the reducer.NoteDo not worry if you do not understand the previous commands yet - wewill explain them in detail later in this chapter.https://www.iteblog.comCreating RDDsThere are two ways to create an RDD in PySpark: you can either.parallelize(...) a collection (list or an array of some elements):data = sc.parallelize( [('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)])Or you can reference a file (or files) located either locally or somewhereexternally:data_from_file = sc.\ textFile( '/Users/drabast/Documents/PySpark_Data/VS14MORT.txt.gz', 4)NoteWe downloaded the Mortality dataset VS14MORT.txt file from (accessedon July 31, 2016)ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2014us.zipthe record schema is explained in this documenthttp://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf. Weselected this dataset on purpose: The encoding of the records will helpus to explain how to use UDFs to transform your data later in thischapter. For your convenience, we also host the file here:http://tomdrabas.com/data/VS14MORT.txt.gzThe last parameter in sc.textFile(..., n) specifies the number ofpartitions the dataset is divided into.TipA rule of thumb would be to break your dataset into two-four partitionsfor each in your cluster.Spark can read from a multitude of filesystems: Local ones such ashttps://www.iteblog.comftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2014us.ziphttp://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdfhttp://tomdrabas.com/data/VS14MORT.txt.gzNTFS, FAT, or Mac OS Extended (HFS+), or distributed filesystemssuch as HDFS, S3, Cassandra, among many others.TipBe wary where your datasets are read from or saved to: The path cannotcontain special characters []. Note, that this also applies to paths storedon Amazon S3 or Microsoft Azure Data Storage.Multiple data formats are supported: Text, parquet, JSON, Hive tables,and data from relational databases can be read using a JDBC driver.Note that Spark can automatically work with compressed datasets (likethe Gzipped one in our preceding example).Depending on how the data is read, the object holding it will berepresented slightly differently. The data read from a file is representedas MapPartitionsRDD instead of ParallelCollectionRDD when we.paralellize(...) a collection.SchemaRDDs are schema-less data structures (unlike DataFrames, which wewill discuss in the next chapter). Thus, parallelizing a dataset, such as inthe following code snippet, is perfectly fine with Spark when usingRDDs:data_heterogenous = sc.parallelize([ ('Ferrari', 'fast'), {'Porsche': 100000}, ['Spain','visited', 4504]]).collect()So, we can mix almost anything: a tuple, a dict, or a list and Sparkwill not complain.Once you .collect() the dataset (that is, run an action to bring it backto the driver) you can access the data in the object as you wouldnormally do in Python:data_heterogenous[1]['Porsche']https://www.iteblog.comIt will produce the following:100000The .collect() method returns all the elements of the RDD to thedriver where it is serialized as a list.NoteWe will talk more about the caveats of using .collect() later in thischapter.Reading from filesWhen you read from a text file, each row from the file forms an elementof an RDD.The data_from_file.take(1) command will produce the following(somewhat unreadable) output:To make it more readable, let's create a list of elements so each line isrepresented as a list of values.Lambda expressionsIn this example, we will extract the useful information from the crypticlooking record of data_from_file.NotePlease refer to our GitHub repository for this book for the details of thismethod. Here, due to space constraints, we will only present anhttps://www.iteblog.comabbreviated version of the full method, especially where we create theRegex pattern. The code can be found here:https://github.com/drabastomek/learningPySpark/tree/master/Chapter03/LearningPySpark_Chapter03.ipynbFirst,let's define the method with the help of the following code, whichwill parse the unreadable row into something that we can use:def extractInformation(row): import re import numpy as np selected_indices = [ 2,4,5,6,7,9,10,11,12,13,14,15,16,17,18, ... 77,78,79,81,82,83,84,85,87,89 ] record_split = re\ .compile( r'([\s]{19})([0-9]{1})([\s]{40}) ... ([\s]{33})([0-9\s]{3})([0-9\s]{1})([0-9\s]{1})') try: rs = np.array(record_split.split(row))[selected_indices] except: rs = np.array(['-99'] * len(selected_indices)) return rsTipA word of caution here is necessary. Defining pure Python methods canslow down your application as Spark needs to continuously switch backand forth between the Python interpreter and JVM. Whenever you can,you should use built-in Spark functions.Next, we import the necessary modules: The re module as we will useregular expressions to parse the record, and NumPy for ease of selectingmultiple elements at once.Finally, we create a Regex object to extract the information as specifiedand parse the row through it.Notehttps://www.iteblog.comhttps://github.com/drabastomek/learningPySpark/tree/master/Chapter03/LearningPySpark_Chapter03.ipynbWe will not be delving into details here describing Regular Expressions.A good compendium on the topic can be found herehttps://www.packtpub.com/application-development/mastering-python-regular-expressions.Once the record is parsed, we try to convert the list into a NumPy arrayand return it; if this fails we return a list of default values -99 so weknow this record did not parse properly.TipWe could implicitly filter out the malformed records by using.flatMap(...) and return an empty list [] instead of -99 values. Checkthis for details: http://stackoverflow.com/questions/34090624/remove-elements-from-spark-rddNow, we will use the extractInformation(...) method to split andconvert our dataset. Note that we pass only the method signature to.map(...): the method will hand over one element of the RDD to theextractInformation(...) method at a time in each partition:data_from_file_conv = data_from_file.map(extractInformation)Running data_from_file_conv.take(1) will produce the followingresult (abbreviated):https://www.iteblog.comhttps://www.packtpub.com/application-development/mastering-python-regular-expressionshttp://stackoverflow.com/questions/34090624/remove-elements-from-spark-rddGlobal versus local scopeOne of the things that you, as a prospective PySpark user, need to getused to is the inherent parallelism of Spark. Even if you are proficient inPython, executing scripts in PySpark requires shifting your thinking abit.Spark can be run in two modes: Local and cluster. When you run Sparklocally your code might not differ to what you are currently used to withrunning Python: Changes would most likely be more syntactic thananything else but with an added twist that data and code can be copiedbetween separate worker processes.However, taking the same code and deploying it to a cluster might causea lot of head-scratching if you are not careful. This requiresunderstanding how Spark executes a job on the cluster.In the cluster mode, when a job is submitted for execution, the job issent to the driver (or a master) node. The driver node creates a DAG(see Chapter 1, Understanding Spark) for a job and decides whichexecutor (or worker) nodes will run specific tasks.The driver then instructs the workers to execute their tasks and returnthe results to the driver when done. Before that happens, however, thedriver prepares each task's closure: A set of variables and methodspresent on the driver for the worker to execute its task on the RDD.This set of variables and methods is inherently static within theexecutors' context, that is, each executor gets a copy of the variablesand methods from the driver. If, when running the task, the executoralters these variables or overwrites the methods, it does so withoutaffecting either other executors' copies or the variables and methods ofthe driver. This might lead to some unexpected behavior and runtimebugs that can sometimes be really hard to track down.Notehttps://www.iteblog.comCheck out this discussion in PySpark's documentation for a more hands-on example: http://spark.apache.org/docs/latest/programming-guide.html#local-vs-cluster-modes.https://www.iteblog.comhttp://spark.apache.org/docs/latest/programming-guide.html#local-vs-cluster-modesTransformationsTransformations shape your dataset. These include mapping, filtering,joining, and transcoding the values in your dataset. In this section, wewill showcase some of the transformations available on RDDs.NoteDue to space constraints we include only the most often usedtransformations and actions here. For a full set of methods available wesuggest you check PySpark's documentation on RDDshttp://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDDSince RDDs are schema-less, in this section we assume you know theschema of the produced dataset. If you cannot remember the positionsof information in the parsed dataset we suggest you refer to thedefinition of the extractInformation(...) method on GitHub, code forChapter 03.The .map(...) transformationIt can be argued that you will use the .map(...) transformation mostoften. The method is applied to each element of the RDD: In the case ofthe data_from_file_conv dataset, you can think of this as atransformation of each row.In this example, we will create a new dataset that will convert year ofdeath into a numeric value:data_2014 = data_from_file_conv.map(lambda row: int(row[16]))Running data_2014.take(10) will yield the following result:Notehttps://www.iteblog.comhttp://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDDIf you are not familiar with lambda expressions, please refer to thisresource:https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/You can of course bring more columns over, but you would have topackage them into a tuple, dict, or a list. Let's also include the 17thelement of the row along so that we can confirm our .map(...) works asintended:data_2014_2 = data_from_file_conv.map( lambda row: (row[16], int(row[16]):)data_2014_2.take(5)The preceding code will produce the following result:The .filter(...) transformationAnother most often used transformation is the .filter(...) method,which allows you to select elements from your dataset that fit specifiedcriteria. As an example, from the data_from_file_conv dataset, let'scount how many people died in an accident in 2014:https://www.iteblog.comhttps://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/data_filtered = data_from_file_conv.filter( lambda row: row[16] == '2014' and row[21] == '0')data_filtered.count()TipNote that the preceding command might take a while depending on howfast your computer is. For us, it took a little over two minutes to return aresult.The .flatMap(...) transformationThe .flatMap(...) method works similarly to .map(...), but it returnsa flattened result instead of a list. If we execute the following code:data_2014_flat = data_from_file_conv.flatMap(lambda row: (row[16], int(row[16]) + 1))data_2014_flat.take(10)It will yield the following output:You can compare this result with the results of the command thatgenerated data_2014_2 previously. Note, also, as mentioned earlier, thatthe .flatMap(...) method can be used to filter out some malformedrecords when you need to parse your input. Under the hood, the.flatMap(...) method treats each row as a list and then simply adds allthe records together; by passing an empty list the malformed records isdropped.The .distinct(...) transformationThis method returns a list of distinct values in a specified column. It isextremely usefulif you want to get to know your dataset or validate it.Let's check if the gender column contains only males and females; thathttps://www.iteblog.comwould verify that we parsed the dataset properly. Let's run the followingcode:distinct_gender = data_from_file_conv.map( lambda row: row[5]).distinct()distinct_gender.collect()This code will produce the following output:First, we extract only the column that contains the gender. Next, we usethe .distinct() method to select only the distinct values in the list.Lastly, we use the .collect() method to return the print of the valueson the screen.TipNote that this is an expensive method and should be used sparingly andonly when necessary as it shuffles the data around.The .sample(...) transformationThe .sample(...) method returns a randomized sample from thedataset. The first parameter specifies whether the sampling should bewith a replacement, the second parameter defines the fraction of thedata to return, and the third is seed to the pseudo-random numbersgenerator:fraction = 0.1data_sample = data_from_file_conv.sample(False, fraction, 666)In this example, we selected a randomized sample of 10% from theoriginal dataset. To confirm this, let's print the sizes of the datasets:print('Original dataset: {0}, sample: {1}'\.format(data_from_file_conv.count(), data_sample.count()))https://www.iteblog.comThe preceding command produces the following output:We use the .count() action that counts all the records in thecorresponding RDDs.The .leftOuterJoin(...) transformation.leftOuterJoin(...), just like in the SQL world, joins two RDDs basedon the values found in both datasets, and returns records from the leftRDD with records from the right one appended in places where the twoRDDs match:rdd1 = sc.parallelize([('a', 1), ('b', 4), ('c',10)])rdd2 = sc.parallelize([('a', 4), ('a', 1), ('b', '6'), ('d', 15)])rdd3 = rdd1.leftOuterJoin(rdd2)Running .collect(...) on the rdd3 will produce the following:TipThis is another expensive method and should be used sparingly and onlywhen necessary as it shuffles the data around causing a performance hit.What you can see here are all the elements from RDD rdd1 and theircorresponding values from RDD rdd2. As you can see, the value 'a'shows up two times in rdd3 and 'a' appears twice in the RDD rdd2. Thevalue b from the rdd1 shows up only once and is joined with the value'6' from the rdd2. There are two things missing: Value 'c' from rdd1does not have a corresponding key in the rdd2 so the value in thehttps://www.iteblog.comreturned tuple shows as None, and, since we were performing a left outerjoin, the value 'd' from the rdd2 disappeared as expected.If we used the .join(...) method instead we would have got only thevalues for 'a' and 'b' as these two values intersect between these twoRDDs. Run the following code:rdd4 = rdd1.join(rdd2)rdd4.collect()It will result in the following output:Another useful method is .intersection(...), which returns therecords that are equal in both RDDs. Execute the following code:rdd5 = rdd1.intersection(rdd2)rdd5.collect()The output is as follows:The .repartition(...) transformationRepartitioning the dataset changes the number of partitions that thedataset is divided into. This functionality should be used sparingly andonly when really necessary as it shuffles the data around, which in effectresults in a significant hit in terms of performance:rdd1 = rdd1.repartition(4)len(rdd1.glom().collect())https://www.iteblog.comThe preceding code prints out 4 as the new number of partitions.The .glom() method, in contrast to .collect(), produces a list whereeach element is another list of all elements of the dataset present in aspecified partition; the main list returned has as many elements as thenumber of partitions.https://www.iteblog.comActionsActions, in contrast to transformations, execute the scheduled task onthe dataset; once you have finished transforming your data you canexecute your transformations. This might contain no transformations(for example, .take(n) will just return n records from an RDD even ifyou did not do any transformations to it) or execute the whole chain oftransformations.The .take(...) methodThis is most arguably the most useful (and used, such as the .map(...)method). The method is preferred to .collect(...) as it only returnsthe n top rows from a single data partition in contrast to .collect(...),which returns the whole RDD. This is especially important when youdeal with large datasets:data_first = data_from_file_conv.take(1)If you want somewhat randomized records you can use.takeSample(...) instead, which takes three arguments: First whetherthe sampling should be with replacement, the second specifies thenumber of records to return, and the third is a seed to the pseudo-random numbers generator:data_take_sampled = data_from_file_conv.takeSample(False, 1, 667)The .collect(...) methodThis method returns all the elements of the RDD to the driver. As wehave just provided a caution about it, we will not repeat ourselves here.The .reduce(...) methodThe .reduce(...) method reduces the elements of an RDD using aspecified method.https://www.iteblog.comYou can use it to sum the elements of your RDD:rdd1.map(lambda row: row[1]).reduce(lambda x, y: x + y)This will produce the sum of 15.We first create a list of all the values of the rdd1 using the .map(...)transformation, and then use the .reduce(...) method to process theresults. The reduce(...) method, on each partition, runs the summationmethod (here expressed as a lambda) and returns the sum to the drivernode where the final aggregation takes place.NoteA word of caution is necessary here. The functions passed as a reducerneed to be associative, that is, when the order of elements is changedthe result does not, and commutative, that is, changing the order ofoperands does not change the result either.The example of the associativity rule is (5 + 2) + 3 = 5 + (2 + 3), and ofthe commutative is 5 + 2 + 3 = 3 + 2 + 5. Thus, you need to be carefulabout what functions you pass to the reducer.If you ignore the preceding rule, you might run into trouble (assumingyour code runs at all). For example, let's assume we have the followingRDD (with one partition only!):data_reduce = sc.parallelize([1, 2, .5, .1, 5, .2], 1)If we were to reduce the data in a manner that we would like to dividethe current result by the subsequent one, we would expect a value of 10:works = data_reduce.reduce(lambda x, y: x / y)However, if you were to partition the data into three partitions, the resultwill be wrong:data_reduce = sc.parallelize([1, 2, .5, .1, 5, .2], 3)data_reduce.reduce(lambda x, y: x / y)https://www.iteblog.comIt will produce 0.004.The .reduceByKey(...) method works in a similar way to the.reduce(...) method, but it performs a reduction on a key-by-keybasis:data_key = sc.parallelize( [('a', 4),('b', 3),('c', 2),('a', 8),('d', 2),('b', 1), ('d', 3)],4)data_key.reduceByKey(lambda x, y: x + y).collect()The preceding code produces the following:The .count(...) methodThe .count(...) method counts the number of elements in the RDD.Use the following code:data_reduce.count()This code will produce 6, the exact number of elements in thedata_reduce RDD.The .count(...) method produces the same result as the followingmethod, but it does not require moving the whole dataset to the driver:len(data_reduce.collect()) # WRONG -- DON'T DO THIS!If your dataset is in a key-value form, you can use the .countByKey()method to get the counts of distinct keys. Run the following code:data_key.countByKey().items()This code will produce the following output:https://www.iteblog.comThe .saveAsTextFile(...) methodAs the name suggests, the .saveAsTextFile(...) the RDD and saves itto text files: Each partition to a separate file:data_key.saveAsTextFile(
- CICLOS (1) 3
- Aula 8 - Indice de coccao
- substituto de gordura
- ESTRATEGIA ALIMENTAR PARA DANIEL
- CARDÁPIO PARA PACIENTES PORTADORES DE AIDS
- CARDÁPIO PARA DOENÇAS PULMONARES
- CARDÁPIO PACIENTE DE LESÃO POR PRESSÃO
- CARDAPIO CICLO final
- 3 relatorio de carnes 2
- ATIVIDADE INTERDISCIPLINAR
- INSPECAO E TECNOLOGIA DO LEITE - resumo teste
- exercicio37
- Construction SUX prépria história. 4. Com o objetivo de ser uma publicação que apresenta as informações necessárias para incentivar, apoiar, proteg...
- Segundo a Lei Orgânica n. 11.346/2006, a segurança alimentar e nutricional consiste na realização do direito de todos ao acesso regular e perma...
- Qual é a definição de Taxa Metabólica Basal (TMB)?
- quais são as atribuições da AAE
- Questão 2 O trigo e o centeio são os dois cereais panificáveis, isso quer dizer que eles possuem as duas proteínas, gliadina e glutenina, que, ao s...
- As substâncias químicas que, em contato com o organismo humano, causam distúrbios no funcionamento de algum dos seus sistemas , são classificada co...
- Qual é a quantidade diária recomendada de fibras para um adulto saudável?
- Qual é um dos principais objetivos da nutrigenética?
- Qual é a principal diferença entre nutrigenética e nutrigenômica?
- Os peixes possuem o menor conteúdo de gordura e possuem a maior proporção de gorduras
- Qual dos seguintes nutrientes é essencial para a saúde óssea das mulheres atletas? Ferro. Vitamina C. Cálcio. Vitamina B12. Zinco.
- A segurança alimentar e nutricional com soberania constitui princípio da PNAN e pauta-se no conceito da Lei nº 11.346/06, que se refere a questões ...
- As carências nutricionais prioritárias em saúde pública são desnutrição, proteico-Calórica, e hipovitaminosa A, B e C Pergunta 38Resposta a. Falso...
- aula 3 TD - ficha técnica de preparo
- NUTRIC INTERDISCIPLINAR - AVA QUESTIONARIO 1