Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would say data analysis is actually very well suited to static types. In particular, static types allow us to be much more specific than e.g. string/int/float (AKA "stringly typed programming"). We can use more specific types like 'count', 'total', 'average', 'dollars', 'distance', 'index', etc. all of which are numeric, but we want to avoid mixing up (e.g. counts can be added, but a count should not be added to a distance).

I also find 'phantom types' incredibly useful. For example, we might have a database or key/value store, where the keys are represented as strings and the values can be strings, integers, floats, booleans, etc. We can 'label' each key with a "phantom type" corresponding to the values. This way, the compiler knows exactly what types are expected, we only have to handle the expected type, we don't need any boilerplate to wrap/unwrap the database results, etc.:

    -- Haskell syntax
    data DBKey t = DBKey String

    dbLookup :: DBKey t -> Maybe t
    dbLookup (DBKey key) = ...

    name = DBKey @String "name"
    age  = DBKey @Int    "age"

    // Scala syntax
    final case class DBKey[T](override val toString: String)

    def dbLookup[T](key: DBKey[T]): Try[T] = ...

    val name = DBKey[String]("name")
    val age  = DBKey[Int   ]("age" )
These are called "phantom types" since they don't actually affect anything about how the code runs; in particular, the definition of 'DBKey' requires a type ('t' or 'T') but doesn't actually use it for anything (it just stores a String). We could get rid of these 'labels' and the code would run identically; or we could have the compiler infer all these labels (defaulting to some uninformative type like 'unit' if it's ambiguous). However, the fact that we've explicitly labelled 'name' with String, and 'age' with Int, constrains the rest of our code to treat them as such, otherwise we'll get a compiler error. For example, if we try to look up a name then our result handler must accept a String, and nothing else; if we provide a default age it must be an Int; etc.

Of course, we can combine such 'labels' with more meaningful types; so that 'name' looks up a 'Username'; 'age' looks up an 'Age', or a 'Nat', or a 'Duration'; or whatever's convenient. It can also be combined with other type system features like implicits/ad-hoc-/subtype-polymorphism, e.g. to decode JSON as its read from the DB, etc.

I've used the example above in real Scala applications when accessing a key/value store. I've also used phantom types (again in Scala) to help me manipulate Array[Byte] values: in that case the bytes could either be 'Plaintext' or 'Encrypted', they could represent either a 'Payload' or a 'Key', and each 'Key' could be intended for either 'Encryption' or 'Signing'. Under the hood these were all just arrays of bytes, and the runtime would treat them as such (using Scala's 'AnyVal' to avoid actually wrapping values up), but these stronger types made things much clearer and safer.

For example, we can decrypt and verify a message which carries its own 'data keys' (encrypted with a 'master key'):

    def decrypt[T](
      key : Bytes[Plaintext, Key[Encryption]],
      data: Bytes[Encrypted, T],
    ): Try[Bytes[Plaintext, T]] = ...

    def validated[T](
      key : Bytes[Plaintext, Key[Signing]],
      data: Bytes[Plaintext, T],
    ): Try[Bytes[Plaintext, T]] = ...

    def decryptMessage[T](
      master : Bytes[Plaintext, Key[Encryption]],
      dataEnc: Bytes[Encrypted, Key[Encryption]],
      dataSig: Bytes[Encrypted, Key[Signing   ]],
      message: Bytes[Encrypted, T              ],
    ): Try[Bytes[Plaintext, T]] = for {
      encKey <- decrypt(master, dataEnc)
      sigKey <- decrypt(master, dataSig)
      raw    <- decrypt(encKey, message)      
      valid  <- validated(sigKey, raw)
    } yield valid
The compiler makes sure we never try to use an encrypted key, we never try to decrypt plaintext, we can't decrypt with a signing key or a payload, we can't validate signatures with an encryption key or a payload, our result type must match the message type, etc. After checking, these types are "erased" to give an implementation of type '(Array[Byte], Array[Byte], Array[Byte], Array[Byte]) => Array[Byte]', which we could have written directly, but I certainly wouldn't trust myself to do so!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: