Key differences between mainly used languages for data science

Posté le Sat 01 September 2018 dans Coding • 6 min de lecture

Cet article est le numéro 1 de la série "type systems" :

  1. Key differences between mainly used languages for data science
  2. Static typing in Python
  3. Data classes in Python

Two notions of typing for programming languages are to be distinguished:

  • strong vs weak
  • static vs dynamic

In this blog post, the two notions will first be explained and then illustrated with four programming languages widely used in data science pipelines: Python, Scala, Javascript and C.

Strongly vs weakly typed

Defining what is a strongly typed, or a weakly typed language, is controversial. Here, I consider weakly typed languages as languages that may perform implicit type conversion at runtime, thus producing sometimes unpredictable results. At the extreme, Assembly language for instance is said to be untyped since there is no type checking. The absence of type checking allows a lot of freedom, necessary for strong optimizations (except for very specific projects, you don't write Assembly code, you write in a higher level language and the compiler produces Assembly code for you).

Static vs dynamic typing

If the type of a variable is known at compile time, the language is called statically typed. It allows to catch trivial bugs at compile time, i.e. before execution.

Static typing does not imply that the programmer has to specify the type of each variable: some languages offer type inference. It is the case for example of Scala and OCaml.

For dynamically typed languages, the type is associated with the value at run-time.

Examples

To get a better understanding of the weaknesses of some languages, we are going to see the four different cases:

  • weakly and dynamically typed
  • strongly and dynamically typed
  • strongly and statically typed
  • weakly and statically typed

Javascript: weakly and dynamically typed

Let's define a small script, add_numbers.js:

function add_numbers(a, b) {
    result = a + b;
    return result;
}

c = "5"
n = 5

result = add_numbers(c, n);

console.log(result);

We defined a function add_numbers which takes two arguments a and b and return their sum. We apply it with a string c, and an int n, and write the result to the standard output. We can run the above script with:

node add_numbers.js

It outputs 55. Javascript converted silently the integer n to a string, and concatenated the two strings. If you expect to see 10, inverting the two arguments does not help you, since it produces the same result: "+" concatenates strings as soon as one of its operands is a string. This pitfall can be avoided by checking the types automatically beforehand (e.g. with Typescript), as we will see in another blog post. It also makes me prefer strongly typed languages, as we will see with Python.

Python: strongly and dynamically typed

Let's write the same script, but this time in Python (add_numbers.py):

def add_numbers(a, b):
    result = a + b
    return result

c = "5"
n = 5

result = add_numbers(c, n)
print(result)

The function add_numbers takes two arguments a and b, and return their sum. We apply it, as previously in Javascript, with a string c and an int n. When running the above script:

python3 add_numbers.py

It outputs:

TypeError: can only concatenate str (not "int") to str

This example shows that Python is strongly typed since it doesn't allow to sum an int and a string. It prevents erroneous results by raising an exception.

This error is however produced at run-time because Python is dynamically typed. It is not ideal since the error only pops up when the code snippet is executed. It means that if you have a long running piece of code before your mistake (forgetting to convert the string to an int for instance), you have to wait that amount of time to notice it, which slows down your development process.

In another blog post, we will see how to use static typing in Python, but for now, let's see another language which is statically typed: Scala.

Scala: strongly and statically typed

Here we define the same function in Scala, this time specifying the type of the arguments, and apply it with a string and an integer:

def add_numbers(a: Int, b: Int) : Int = {
  val result = a + b
  return result
}

val c = "5"
val n = 5

val result = add_numbers(c, n)
result

We can run this script (add_numbers.scala) with the Scala REPL. It compiles it and executes the resulting program:

scala add_numbers.scala

It outputs:

add_numbers.scala:9: error: type mismatch;
 found   : String
 required: Int
   val result = add_numbers(c, n)
                            ^
one error found

The error is found at compile time. To convince you, we can separate the two steps: compile and run. For this, we have to wrap the script into an object to produce a standalone Scala code:

object AddNumbers {
  def main(args: Array[String]): Unit = {
    val c = "5"
    val n = 5

    val result = add_numbers(c, n)
    println(result)
  }

  def add_numbers(a: Int, b: Int) :Int = {
    val result = a + b
    result
  }
}

Wrapping the script into an object was automatically done by the Scala REPL, which is a tool to evaluate expressions in Scala.

Here, we also show how to use type inference in Scala: the types of c and n are inferred to be String and Int respectively. The function add_numbers takes two arguments of type Int and return a value of the same type.

We can compile the Scala code with:

scalac add_numbers.scala

which outputs:

add_numbers.scala:6: error: type mismatch;
 found   : String
 required: Int
    val result = c + n
                 ^
one error found

Being able to detect such mistakes at compile time is extremely powerfull. This is possible with languages that are both strongly and statically typed. If the language is weakly typed, we fall into one of the pitfalls of Javascript: wrong results at run-time, as we will see with C.

C: weakly and statically typed

Here is again the same example, written in C:

#include <stdio.h>

int add_numbers(int a, int b) {
    int result = a + b;
    return result;
}

int main() {
    char c = "5";
    int n = 5;

    int result = add_numbers(c, n);
    printf("%d\n", result);

    return 0;
}

It can be compiled with:

gcc -w add_numbers.c

Which produces the executable add_numbers.out.

When running the executable:

./add_numbers.out

It outputs -73. This example shows the danger when using weakly typed languages.

In fact, when running the compiler, I disabled the warnings with the -w option. A good compiler with all the warnings activated can detect such things, for instance when compiling without the -w flag:

add_numbers.c:9:10: warning: incompatible pointer to integer conversion initializing 'char' with an expression
      of type 'char [2]' [-Wint-conversion]
    char c = "5";
         ^   ~~~
1 warning generated.

In more complicated cases (for instance using void pointers), the compiler does not catch the errors.

Conclusion

We have seen in this blog post two notions of how the computer languages handle the types: the notion of static and dynamic typing, and the notion of strong and weak typing.

Each case have been illustrated with one programming language:

  • Scala is both strongly and statically typed: It is usefull to catch the potential erroneous code at an early stage. Scala is therefore often used in data engineering pipelines, where you want to spot as much errors as you can, before going into production.
  • Python is strongly and dynamically typed: Its strong type system allows to catch erroneous code, but its dynamic nature makes you spot them only at run-time. Python is often used by data scientists, and its dynamic nature is justified since you want to see the results immediately, avoiding an often slow compilation. The REPL allows to run code as you write it (and it is faster than with some statically typed languages which provide also a REPL, such as Scala).
  • C is statically but weakly typed: The weakly type system allows some freedom for speed improvements and allows to handle the memory at a low level. It is thus perfectly fine to use it when you know what you are doing, for tasks where the memory footprint and speed are important.
  • Javascript is dynamically and weakly typed: It is sometimes convenient since, when you know how it works, the interpreter does the job for you, as it handles automatically the conversions.