👨🏾‍🏫 👐🏻 🧒🏿 Spark 3.0: nuevas funciones y ejemplos de su uso - parte 1 👩🏿‍🤝‍👨🏼 👌🏿 👩🏽‍🤝‍👩🏼

Para nuestro nuevo programa "Apache Spark para ingenieros de datos" y el seminario web sobre el curso del 2 de diciembre, hemos preparado una traducción de un artículo general sobre Spark 3.0.

Spark 3.0 presentó un montón de mejoras importantes, que incluyen: rendimiento mejorado con ADQ, lectura de binarios, compatibilidad mejorada con SQL y Python, Python 3.0, integración con Hadoop 3, compatibilidad con ACID.

En este artículo, el autor trató de dar ejemplos del uso de estas nuevas funciones. Este es el primer primer artículo sobre la funcionalidad de Spark 3.0 y se prevé que esta serie de artículos continúe.

Este artículo destaca las siguientes características de Spark 3.0:

Marco de ejecución de consultas adaptativas (AQE)
Soporte para nuevos idiomas
Nueva interfaz para transmisión estructurada
Leer archivos binarios
Navegación recursiva de carpetas
Compatibilidad con delimitadores de datos múltiples (||)
Nuevas funciones integradas de Spark
Cambiar al calendario gregoriano proléptico
Cola del marco de datos
Función de partición en consultas SQL
Compatibilidad mejorada con ANSI SQL

(AQE) – , , Spark 3.0. , , .

3.0 Spark , , Spark , . AQE , , , .

, (AQE) . spark.sql.adaptive.enabled true. AQE, Spark TPC-DS Spark 2.4

AQE Spark 3.0 3 :

,
join sort-merge broadcast

Spark 3.0 , :

Python3 (Python 2.x)
Scala 2.12
JDK 11

Hadoop 3 , Kafka 2.4.1 .

Spark Structured Streaming

web- Spark . , , , -, . , .

2 :

: Databricks

«Active Streaming Queries» , «Completed Streaming Queries» – .

Run ID : , , , , , . , Databricks.

Spark 3.0 “binaryFile”, .

binaryFile, DataFrameReader image, pdf, zip, gzip, tar . , .

val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")

df.printSchema()

df.show()

root

|-- path: string (nullable = true)

|-- modificationTime: timestamp (nullable = true)

|-- length: long (nullable = true)

|-- content: binary (nullable = true)

+--------------------+--------------------+------+--------------------+

+--------------------+--------------------+------+--------------------+

|file:/C:/tmp/bina…|2020-07-25 10:11:…| 74675|[89 50 4E 47 0D 0...|

+--------------------+--------------------+------+--------------------+

Spark 3.0 recursiveFileLookup, . true , DataFrameReader , .

spark.read.option("recursiveFileLookup", "true").csv("/path/to/folder")

Spark 3.0 (||) CSV . , CSV :

col1||col2||col3||col4

val1||val2||val3||val4

val df = spark.read

.option("delimiter","||")

.option("header","true")

.csv("/tmp/data/douplepipedata.csv")

Spark 2.x , . :

throws java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ||

Spark

Spark SQL, Spark .

sinh,cosh,tanh,asinh,acosh,atanh,any,bitand,bitor,bitcount,bitxor,

booland,boolor,countif,datepart,extract,forall,fromcsv,

makedate,makeinterval,maketimestamp,mapentries

mapfilter,mapzipwith,maxby,minby,schemaofcsv,tocsv

transformkeys,transform_values,typeof,version

xxhash64

Spark : 1582 , – .

JDK 7 java.sql.Date API. JDK 8 java.time.LocalDate API .

Spark 3.0 , Pandas, R Apache Arrow. 15 1582 ., Date&Timestamp, Spark 3.0, . , 15 1582 .

Spark 3.0 Date & Timestamp :

makedate(), maketimestamp(), makeinterval().

makedate(year, month, day) – <>, <> <>.

makedate(2014, 8, 13)

//returns 2014-08-13.

maketimestamp(year, month, day, hour, min, sec[, timezone]) – Timestamp <>, <>, <>, <>, <>, < >.

maketimestamp(2014, 8, 13, 1,10,40.147)

//returns Timestamp 2014-08-13 1:10:40.147

maketimestamp(2014, 8, 13, 1,10,40.147,CET)

makeinterval(years, months, weeks, days, hours, mins, secs) –

makedate() make_timestam() 0.

DataFrame.tail()

Spark head(), , tail(), Pandas Python. Spark 3.0 tail() . tail() scala.Array[T] Scala.

val data=spark.range(1,100).toDF("num").tail(5)

data.foreach(print)

//Returns

//[95][96][97][98][99]

repartition SQL

SQL Spark actions, Dataset/DataFrame, , Spark SQL repartition() . SQL-. .

val df=spark.range(1,10000).toDF("num")

println("Before re-partition :"+df.rdd.getNumPartitions)

df.createOrReplaceTempView("RANGE¨C17CTABLE")

println("After re-partition :"+df2.rdd.getNumPartitions)

//Returns

//Before re-partition :1

//After re-partition :20

ANSI SQL

Spark data-, ANSI SQL, Spark 3.0 . , true spark.sql.parser.ansi.enabled Spark .

Newprolab Apache Spark:

Apache Spark - (Scala). 11 , 5 .

Apache Spark (Python). " ". 6 , 5 .

Spark 3.0: nuevas funciones y ejemplos de su uso - parte 1

Spark Structured Streaming

Spark

DataFrame.tail()

repartition SQL

ANSI SQL

More articles: