Data science, Python vs Scala

Lokesh kumar Jain
3 min readMay 15, 2019

I recently got an opportunity to so some big data analysis. The work was to get some data from large files and create an analysis.

For this task, I was introduced to Spark on Databricks. We can use the following languages to write our logic in the notebook.

  • Python
  • R
  • Scala
  • SQL / Hive context

I new Python before, and a little bit of Scala programming. Obvious i was confused which will be better for the given task and for the long term benefit.

I search on the internet to get some opinion on this. It was a mix between both.

Initially, I was more inclined to Python as it is getting popular day by day. But after working for a month I find Scala is better for the DataScience and DataAnalysis and my reasons to pick Scala are these.

  1. Scala is a typed language. Type safety is a huge win for productivity as you will catching more and more bugs before running your code and trust me these tasks consume a lot of time before you know you are right. This is the same reason TypeScript is becoming more and more popular.
  2. For the major tasks on analysis like filter and mapping. Scala seems better as it supports functional aspects. You can use map`, filter, reduce and get the benefit of using Scala data structures
  3. Datasets, is Spark there are followings
    1. RDD
    2. DataFrames
    3. DataSets
    You can clearly see the benefit of using DataSet over dataFrame. with datasets are the DataFrame+Type+Shape. To use dateset API you need a typed language like Scala or java. I think Scala will be a choice here.

But with the above reason still Scala seems a bit tough while getting started with. This is the main reason people will say Python is more natural and should be chosen. But after doing data analysis using Scala I find you don’t need all the weirdness of Scala using best of it like only data structures and functional aspect of it don’t look into OOPs . Trust me you don't’ need it.

Scala has immutable data structures. Which is a huge win for the predictive outcome and is a plus on Spark distributed system.

For beginners, I have some points that will help you understand Scala the code for data Science / Analysis.

  1. Forget your shorthand syntax to create data structure. like [],{} etc. They don’t work. Use proper class like List, Map, Vector, Seq ...
  2. You can’t get values using [index] and . . In scala use () syntax. e.g.
    val listOfLanguages = List("Python", "Scala", "R", "SQL")
    // to get the values at index 2
    // You can't do listOfLanguages[2] or listOfLanguages.2
    // right way is
    var selectedLanguage = listOfLanguages(2) // Note: this is true for Map
  3. String: unlike other languages, there is a clear difference in how to write String and Char.
    val str = "String should be enclosed by \" bouble quote always dont user single quote"
    For char always user ' single quote.
  4. You remember we mentioned about immutable data structure in Scala . There are two types of binding one is using val and other is var. Please read the difference.
    value: you can’t change the reference.
    var: you can reassign the value but type should be the same.

Note: this is my first blog about data science and Scala. Let me know your feedback, this will help me to improve the quality of this article and my skills.

--

--

Lokesh kumar Jain

Love Web building technologies working as frontend team lead loves working on JavaScript tech stack, as Reactjs, Angular, HTML5