Logistic Regression Using PySpark in Python

In this era of Big Data, knowing only some machine learning algorithms wouldn’t do. One has to have hands-on experience in modeling but also has to deal with Big Data and utilize distributed systems. In this tutorial, we are going to have look at distributed systems using Apache Spark (PySpark).

What is Big Data and Distributed Systems?

Big data is a combination of structured, semistructured, and unstructured data in huge volume collected by organizations that can be mined for information and used in predictive modeling and other advanced analytics applications that help the organization to fetch helpful insights from consumer interaction and drive business decisions.

Big Data calls for Big Resources

It is not only difficult to maintain big data but also difficult to work with. Resources of a single system are not going to be enough to deal with such huge amounts of data (Gigabytes, Terabytes, and Petabytes) and hence we use resources of a lot of systems to deal with this kind of volume. Apache Spark lets us do that seamlessly taking in data from a cluster of storage resources and processing them into meaningful insights. I wouldn’t go deep into HDFS and Hadoop, feel free to use resources available online.

For demonstration purposes, we are going to use the infamous Titanic dataset. Although it is not in the category of Big Data, this will hopefully give you a starting point as to working with PySpark. Link to the dataset is given here

Firstly, we have to import Spark-SQL and create a spark session to load the CSV.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myproj').getOrCreate()
data = spark.read.csv('titanic.csv',inferSchema=True,header=True)

Now, let’s have a look at the schema of the dataset.

 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)

In case you want a list of columns:


Now we will select only the useful columns and drop rows with any missing value:

my_cols = data.select(['Survived',
my_final_data = my_cols.na.drop()

Data formatting and Categorical features

PySpark expects data in a certain format i.e in vectors. All features should be converted into a dense vector. Don’t worry PySpark comes with build-in functions for this purpose and thankfully it is really easy. But first, we have to deal with categorical data.

If you inspect the data carefully you will see that “Sex” and “Embarkment” are not numerical but categorical features. To convert them into numeric features we will use PySpark build-in functions from the feature class.

from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
gender_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')
embark_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')
assembler = VectorAssembler(inputCols=['Pclass',


We will import and instantiate a Logistic Regression model.

from pyspark.ml.classification import LogisticRegression
log_reg_titanic = LogisticRegression(featuresCol='features',labelCol='Survived')

We will then do a random split in a 70:30 ratio:

train_titanic_data, test_titanic_data = my_final_data.randomSplit([0.7,.3])

Then we train the model on training data and use the model to predict unseen test data:

fit_model = log_reg_titanic.fit(train_titanic_data)
results = fit_model.transform(test_titanic_data)

Let’s evaluate our model, shall we?

from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
AUC = my_eval.evaluate(results)
print("AUC score is : ",AUC)
AUC score is : 0.7918269230769232

Again, using PySpark for this small dataset is surely an overkill but I hope it gave you an idea as to how things work in Spark. Sky is the limit for you now.

One response to “Logistic Regression Using PySpark in Python”

  1. Abdul says:

    Where does the assembler come in use? Once created, I’m not sure what it does. If I follow this code, I get an error saying “IllegalArgumentException: features does not exist” when I try train the model on the training data. Any help will be appreciated

Leave a Reply

Your email address will not be published. Required fields are marked *