Data science, Python vs Scala
I recently got an opportunity to so some big data analysis. The work was to get some data from large files and create an analysis.
For this task, I was introduced to Spark on Databricks. We can use the following languages to write our logic in the notebook.
- Python
- R
- Scala
- SQL / Hive context
I new Python before, and a little bit of Scala programming. Obvious i was confused which will be better for the given task and for the long term benefit.
I search on the internet to get some opinion on this. It was a mix between both.
Initially, I was more inclined to Python as it is getting popular day by day. But after working for a month I find Scala is better for the DataScience and DataAnalysis and my reasons to pick Scala are these.
- Scala is a typed language. Type safety is a huge win for productivity as you will catching more and more bugs before running your code and trust me these tasks consume a lot of time before you know you are right. This is the same reason TypeScript is becoming more and more popular.
- For the major tasks on analysis like filter and mapping. Scala seems better as it supports functional aspects. You can use
map
`,filter,
reduce
and get the benefit of usingScala data structures
Datasets,
is Spark there are followings
1. RDD
2. DataFrames
3. DataSets
You can clearly see the benefit of usingDataSet
overdataFrame.
withdatasets
are theDataFrame+Type+Shape
. To usedateset
API you need a typed language likeScala
orjava
. I thinkScala
will be a choice here.
But with the above reason still Scala
seems a bit tough while getting started with. This is the main reason people will say Python
is more natural and should be chosen. But after doing data analysis using Scala
I find you don’t need all the weirdness of Scala
using best of it like only data structures and functional aspect of it don’t look into OOPs
. Trust me you don't’ need it.
Scala has immutable data structures. Which is a huge win for the predictive outcome and is a plus on Spark distributed system.
For beginners, I have some points that will help you understand Scala
the code for data Science / Analysis.
- Forget your shorthand syntax to create data structure. like
[],{}
etc. They don’t work. Use proper class likeList, Map, Vector, Seq ...
- You can’t get values using
[index] and .
. In scala use()
syntax. e.g.
val listOfLanguages = List("Python", "Scala", "R", "SQL")
// to get the values at index 2
// You can't do listOfLanguages[2] or listOfLanguages.2
// right way is
var selectedLanguage = listOfLanguages(2) // Note: this is true for Map - String: unlike other languages, there is a clear difference in how to write
String and Char.
val str = "String should be enclosed by \" bouble quote always dont user single quote"
For char always user'
single quote. - You remember we mentioned about
immutable
data structure inScala
. There are two types of binding one is usingval
and other isvar.
Please read the difference.
value: you can’t change the reference.
var: you can reassign the value but type should be the same.
Note: this is my first blog about data science and Scala. Let me know your feedback, this will help me to improve the quality of this article and my skills.