Apache Spark and Scala: Distributed Computing Framework for Big Data
Spark Architecture and RDD Model
Spark: distributed computing framework on Hadoop/Kubernetes. RDD (Resilient Distributed Dataset): immutable collection across cluster nodes. Example: rdd = sc.parallelize([1, 2, 3, 4, 5]) (create RDD from local list). Operations: transformation (map, filter → new RDD), action (collect, count → result). Lazy evaluation: transformations don't execute until action called. Example: rdd2 = rdd.map(_ * 2) (no computation yet). collect() triggers execution (map applied to all elements). Fault tolerance: RDD tracks lineage (parent RDD, transformation applied). If node fails, recompute from lineage (no data lost). Partitioning: RDD split across cluster. rdd.repartition(4) creates 4 partitions. Narrow transformations: each partition depends on single parent partition (fast). Wide transformations: each partition depends on multiple parents (shuffle, slower). Example: map (narrow), groupByKey (wide shuffle).
DataFrame and SQL APIs
DataFrame: structured data (like Pandas, but distributed). df = spark.read.csv("data.csv", header=True). Schema inference: Spark auto-detects column types. Or explicit: schema = StructType([StructField("name", StringType()), StructField("age", IntegerType())]). SQL: df.createOrReplaceTempView("users"); spark.sql("SELECT * FROM users WHERE age > 18"). SQL performance: Catalyst optimizer rewrites queries (similar to database query planner). Example: SELECT name, COUNT(*) FROM users GROUP BY name → optimized execution plan. Distributed execution: query distributes across cluster (each node processes subset of data). Shuffle: data movement between nodes (necessary for GROUP BY, JOIN). Cost: shuffle expensive, minimize by partitioning wisely. Caching: df.cache() keeps DataFrame in memory (subsequent operations fast). Broadcast variables: small data replicated to all nodes (efficient joins). Example: employees.join(departments, broadcast(departments)) (broadcast smaller table).
Scala Language Features
Scala: JVM language, functional + object-oriented. Type inference: val x = 5 (type inferred as Int). Collections: List, Map, Set. Immutable by default. val list = List(1, 2, 3) (cannot modify). Higher-order functions: map, filter, reduce. list.map(_ * 2).filter(_ > 2) (chained operations). Pattern matching: case statement. value match { case 1 => "one", case 2 => "two", case _ => "other" }. Destructuring: val (head, tail) = (1, List(2, 3)). Implicit conversions: implicit def intToString(i: Int): String = i.toString (hidden conversion). Case classes: immutable data holders. case class User(name: String, age: Int). Companion object: add methods. object User { def apply(name: String) = new User(name, 0) }. Traits: like interfaces, can have implementation. Actor model: lightweight concurrency (Akka framework). Lazy evaluation: by-name parameters. Performance: Scala compiles to Java bytecode (JVM optimization applies).
Spark ML and MLlib
MLlib: machine learning library (Spark's native API). Pipelines: DataFrame → transformer → transformer → ... → predictions. Example: StringIndexer (encode categorical) → OneHotEncoder (binary vector) → LogisticRegression (classify). Transformer: takes DataFrame, outputs new DataFrame. Estimator: learns from data (LogisticRegression.fit() trains model). Stages: pipeline has multiple stages (preprocessing, feature engineering, model). Training: df_train | pipeline.fit() (fit all stages). Predicting: df_test | pipeline.transform() (apply trained stages). Tuning: CrossValidator tests parameter grid. Example: test regularization [0.1, 1.0, 10.0], find best performance. Evaluation: BinaryClassificationEvaluator (AUC metric). Distributed training: Spark distributes SGD (stochastic gradient descent) across nodes. Data parallelism: each node processes batch of data, gradients aggregated. Convergence: typically 10-20 iterations (fast). Limitations: MLlib not as advanced as Keras/PyTorch. Use case: traditional ML (classification, regression), not deep learning.
Streaming and Real-Time Processing
Spark Streaming: discretized streams (DStream). Real-time data → micro-batches → RDD every interval. Example: kafka_stream = KafkaUtils.createDirectStream(ssc, brokers, topics). Stateful operations: updateStateByKey (accumulate state). Example: count events by user over time. WindowedOperations: tumbling (non-overlapping), sliding (overlapping). Example: df.window(Window.partitionBy("user").orderBy("timestamp").rangeBetween(sql.window.Window.unboundedPreceding, 0)).agg(functions.count("*")) (recent events per user). Latency: micro-batches add latency (seconds typical). Use case: not suitable for sub-second latency. Alternative: Kafka Streams (low-latency, simpler setup). Integration: Kafka, HDFS, S3 sources. Output: print(), save to database, send to external systems. Checkpointing: recovery on failure (recompute from last checkpoint). Graceful shutdown: stop() finishes current batch before shutting down.
Performance Optimization and Tuning
Partitioning: num_partitions ~ 2-4 × CPU cores. Example: 16-core cluster → 64 partitions. Too few: underutilization (some cores idle). Too many: overhead (small batches, communication cost). Shuffle optimization: minimize data movement. Example: avoid groupByKey (shuffles all data), use reduceByKey (pre-aggregate per partition first). Serialization: Kryo serialization faster than Java (default). spark.serializer = org.apache.spark.serializer.KryoSerializer. Memory tuning: executor.memory (per-executor heap), driver.memory (driver process). Garbage collection: G1GC recommended (low pause times). Monitoring: Spark UI (http://localhost:4040) shows job progress, executor metrics. Bottleneck detection: long stages indicate data skew or slow operations. Caching strategy: cache intermediate RDDs to avoid recomputation. Broadcast variables: for large lookup tables (joins). Example: broadcast(small_df) sends to all nodes (~few hundred MB typical). Columnar compression: Parquet format (50%+ compression vs CSV). Storage level: MEMORY_ONLY (fast, limited), MEMORY_DISK (spills to disk), DISK_ONLY (slow but unlimited).
Deployment and Operations
Cluster managers: YARN (Hadoop), Kubernetes, Mesos, Standalone. Mode: local (development, single machine), cluster (production). Deployment: spark-submit --master yarn script.py. Driver: driver node coordinates job. Executors: worker processes execute tasks. Example: driver sends task to 4 executors, each processes partition. Application logs: driver logs, executor logs (helpful debugging). Monitoring: Prometheus metrics integration. Metrics: CPU, memory, disk I/O. Alerting: trigger on high memory usage, job failures. Testing: unit tests (Scala/Python), integration tests (local Spark cluster). CI/CD: build JAR, submit to cluster, validate results. Data pipeline: schedule with Airflow, Kubernetes CronJob. Example: daily batch job reads data, trains model, predictions saved. Disaster recovery: checkpoints (enable resumption from failure). Cost optimization: use spot instances (cheaper, preemptible). Scaling: horizontal (add nodes), vertical (larger machines). Guidelines: aim for 80-90% cluster utilization (rest reserved).