Functional Programming and MapReduce
(Notes from a talk I gave in 2008 and later to the Austin Java User's Group.)
- What we will learn today:
- What is Functional Programming?
- Why it matters
- Some characteristics of Functional Programming
- MapReduce as an example of Functional Programming
- Definition of Functional Programming from our friends at Wikipedia:
In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast with the imperative programming style that emphasizes changes in state.
- Why does Functional Programming matter?
- Parallel programming
FP has seen a resurgence of interest due to the upcoming availability of chips with multiple cores. In the not too distant future our workstations will have 32 CPUs on them. Also, castoff old PCs or cheap new ones can be networked together to make a supercomputer. To use them efficiently will require a retooling of how we program. Functional Programming helps to code in a parallel manner.
- Code Modularity
FP also allows a higher level of modularity since functions can now be modules.
- Smaller program size
FP programs are typically smaller (by a factor of two to ten). Since they are smaller, they tend to be less error-prone.
- Parallel programming
- Some Characteristics of Functional Style
- Focus on "what" should be done, but not necessarily "how".
For example, in a spreadsheet we define the values of cells, but do not specify the order in which the cells are evaluated. Another example is SQL. We do not specify the order of JOINs, SELECTs, or WHEREs. The SQL engine does all that for us.
- Pure Functions
Pure functions use only the variables passed to them and have no side effects. Pure functions never interact with the outside world - no writing to files or communicating with the user. The function can be called over and over and will always return the same value.
This is like a mathematical function. It doesn't matter how many times you evaluate
Area = PI * (R**2)
for a given "R" the answer is always the same. The answer does not depend on what functions you called before this.
This gives us these benefits:- The order of function invocation does not matter.
- Functions can use lazy evaluation.
- Functions can be optimized.
- Refactoring and testing is easier.
- Functions lead to composability (which makes C# query operators more reusable).
Some imperative language compilers (eg, C++) allow pure functions to be marked as such for optimization.
Question: Are programs composed entirely of pure functions very useful?
- Lambda Expressions
Lambda expressions make code more readable by allowing anonymous inline methods.
Here's an example from emacs on defining a key, "Meta-Control-p" to mean insert "<p></p>" and move the cursor to inbetween the tags. We could have bound the key to a function with the same code, but it's just faster and clearer to use lambdas.
(define-key global-map "\M-\C-p" '(lambda () (interactive) (insert "<p></p>")(forward-char -4)))
- Immutable variables
An immutable variable has its value set once and can never change its value. Like a brick set in concrete on the left, it's not going anywhere. Normal variables are more like pebbles on the right that you can pick up and move around and change their "state".
Immutable variables makes parallel execution easier, but it's really weird to us imperative programmers.
XSLT transformations are an example of a language with immutable variables. Spreadsheets are another example of functional programming with immutable variables.
But how can we get rid of state? Here is the traditional imperative programming style to sum a collection of numbers:
int sum = 0; foreach(int i in new int[]{1,2,3}) { sum += i; } Console.Out.WriteLine("sum=" + sum); //sum=6
Using recursion in LISP we can accomplish the same thing:
(setq x '(1 2 3)) ;; "car" takes the first element of the list ;; "cdr" makes a list of all but the first element (defun add (numbers) "adds a list of numbers" (if (null numbers) 0 (+ (car numbers) (add (cdr numbers))) )) (add x) ;=>6
- "Higher-order" functions
- Definition:
Higher-order functions take functions as arguments and/or return functions as results. Functions are first class members of the program.
Here's a quick example in Ruby of a higher order function, "select". It returns all items matching a logical expression that is passed into "select" as an argument.
numbers = [1,2,3,4,5,6,7]; evens = numbers.select{|x| x % 2 == 0} puts evens #=> [2, 4, 6]
In early C# versions, the Array.Sort method needed an object with the IComparer interface. A function could not stand alone, it had to be embedded in an object in this Kingdom of Nouns.
string[] names = tmp.Split(' '); Array.Sort(names,new LengthComparer()); ... public class LengthComparer: IComparer { public int Compare(Object obj1, Object obj2) { ... } }
Why not create an anonymous function instead of creating a whole class which only has one function?
- A Common Higher Order Function - "Reduce" (aka fold, inject, accumulate)
Produces a single result from applying a function over a list of objects. (Kinda like those pills that reduce a 12oz steak, baked potato, chef salad, and cheesecake into a single pill. Shown here to the right - you may have to squint just a little to see steak).
In Ruby the reduce function is called "inject", since you are injecting an initial value. Note that the function we are passing to the inject method is sum+prime and product*prime.
primes = [1,3,5,7,11,13]; #using "inject" to sum. We pass in "0" as the initial value total = primes.inject(0){|sum,prime| sum+prime} puts total #=>40 puts primes.inject(1){|product,prime| product*prime} #=>15015
- Another Common Higher Order Function "Map"
From Wikipedia:
In many programming languages, map is the name of a higher-order function that applies a given function to a sequence of elements (such as a list) and returns a sequence of results.- "mapcar" in Lisp:
(setq x '(1 2 3)) (defun multiply_by_three (a) (* 3 a) ) (mapcar 'multiply_by_three x) ;produces: (3 6 9)
- "select" in C#:
using System; using System.Linq; using System.Linq.Expressions; using System.Collections.Generic; ... var values = new System.Collections.Generic.List<int>() { 1, 3, 5, 7 }; var foo = values.select(d => d * d);
- "map" in the proposed Java 7 BGGA proposal syntax
final Array<Integer> a = array(1, 2, 3); final Array<Integer> b = a.map({int i => i + 42}); arrayShow(intShow).println(b); // {43,44,45}
- "mapcar" in Lisp:
- Definition:
- Lazy Evaluation
In computer programming, lazy evaluation (or delayed evaluation) is the technique of delaying a computation until such time as the result of the computation is known to be needed.
For example, the unix "grep" command will normally find all occurrences of a string. But if it knows the "head" program will only print the first three, "grep" can stop after the first three and not find the remaining strings. This is lazy evaluation - only the data really needed is produced.
grep Battlestar movies.txt | head 3
- Other Characteristics and Topics
Not mentioned here are many aspects of FP that are worth your investigation: pattern matching, currying, closures, and monads.
- Focus on "what" should be done, but not necessarily "how".
- Example of a Functional Style in the Real World - MapReduce
Google's MapReduce framework is used to process 20 petabytes of data a day. MapReduce is a way of creating a massive supercomputer from vast farms of humble PCs. Although it is written in C++, the philosophy behind MapReduce is functional: "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -Jeffrey Dean and Sanjay Ghemawat
MapReduce was originally used to create an inverted index of the web.
A programmer specifies a "map" function
map (in_key, in_value) -> list(out_key, intermediate_value)
and a "reduce" function.
reduce (out_key, list(intermediate_value)) -> list(out_value)
- The MapReduce framework splits a large task into N chunks and spawns N "map" processes, typically on different machines, to process those chunks.
- The "map" function reads a set of key/value pairs, processes them in any way it deems fit, and produces intermediate (key, data) data structures.
- A splitter function, typically a hashing function, groups the (key, data) objects into M buckets and writes each bucket to a separate file. During the map phase N*M files are created.
- The framework periodically checks to make sure all the "map" functions are running. The framework can detect failures and reassign work if a process is thought to have gone rogue.
- After all the "map" functions have finished, the framework spawns M "reduce" processes to read the intermediate values. Each "reduce" process reads N bucket files, one from each of the "map" processes, and sorts them by the key. The "reduce" process then calls the user's reduce function with the key and a list of intermediate values. It then creates the final product.
Oversimplification of page rank algorithm:
map(string page_url, string pagecontents) -> (target_url1, page_url),(target_url2, page_url),... reduce(target_urlX, list(start_url)) -> (target_urlX, weight)
- Purity
"Purely functional" languages follow the rules. Examples of these are Haskell, Erlang, ML, and XSLT. Most commercial functional langauges are multi-paradigm, having some aspects of functional programming, but not pure. Examples of these are LISP, Ruby, C# 3.0 and F#.
- Value of the Functional Approach
From Joel on Software
Without understanding functional programming, you can't invent MapReduce, the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. The very fact that Google invented MapReduce, and Microsoft didn't, says something about why Microsoft is still playing catch up trying to get basic search features to work, while Google has moved on to the next problem: building Skynet^H^H^H^H^H^H the world's largest massively parallel supercomputer. I don't think Microsoft completely understands just how far behind they are on that wave. - The ugly side of functional programming
- Memory
FP will often take more memory than imperative programming.
- Speed
Hand-coding in C can give faster code
- Learning Curve is steep
Not many programmers are trained in functional programming and it can take a while to grasp the concepts and use them well.
- Memory
- Review
- Why is functional programming making a comeback?
- Five characteristics of Functional Programming shown tonight:
What not how, Pure Functions, Lambda Expressions, Immutable variables, "Higher-order" functions, Lazy Evaluation
- Sources/Further Reading:
- functionaljava.org
- javac.info
- Why Functional Programming Matters, 1984. (html)
- Joel on software.
- Curiously, the best description of MapReduce I found was from one of its detractors here.