Answers | Spark 2 Workbook

# 3️⃣ Keep only unique words distinct_words = words.distinct()

val spark = SparkSession.builder() .appName("DeptSalary") .getOrCreate()

---

## 6. Quick Reference Cheatsheet (Spark 2.4) spark 2 workbook answers

# 2️⃣ Split lines into words and clean them words = lines.flatMap(lambda line: line.split()) \ .map(lambda w: w.lower().strip('.,!?"\''))

# 1️⃣ Load the file as an RDD lines = sc.textFile("hdfs:///data/input.txt")

If the workbook includes a **mini‑project** (e.g., “process a log dataset and produce a daily report”), you can outline the full pipeline: # 3️⃣ Keep only unique words distinct_words = words

---

def fetch_batch(it): session = requests.Session() for url in it: yield session.get(url).text session.close()

## 8. Final Checklist Before Submitting

# 4️⃣ Action – trigger the computation and collect the count unique_word_count = distinct_words.count()

## 5. Tips for Maximising Marks

| Operation | PySpark | Scala | |-----------|---------|-------| | **Read CSV** | `spark.read.option("header","true").csv(path)` | `spark.read.option("header","true").csv(path)` | | **Write Parquet** | `df.write.parquet("out.parquet")` | `df.write.parquet("out.parquet")` | | **Cache** | `df.cache()` | `df.cache()` | | **Repartition** | `df.repartition(10)` | `df.repartition(10)` | | **Window** | `from pyspark.sql.window import Window` | `import org.apache.spark.sql.expressions.Window` | | **UDF** | `spark.udf.register("toUpper", lambda s: s.upper(), StringType())` | `udf((s: String) => s.toUpperCase, StringType)` | | **Streaming read** | `spark.readStream.format("socket")...` | `spark.readStream.format("socket")...` | | **Stop Spark** | `spark.stop()` | `spark.stop()` | Tips for Maximising Marks | Operation | PySpark

```python from pyspark import SparkContext

print(f"Unique words: unique_word_count")