Building a simple [Yahoo] S4 application

S4 is a distributed stream processing platform from Yahoo. It is often seen as the real-time counterpart of Hadoop. S4 being fault tolerant and horizontally scalable helps you in building very large stream processing application that can do anything from detecting earthquakes to finding that perfect bit of advertising that the visitor on your website is most likely to click.

At its core, an S4 application consists of a number of Processing Elements (PEs) that are wired together with the help of a spring configuration file that defines the PEs and the flow of events in the system. Also, events are produced by event producers that listen that sends these events to the client adapter for S4, from where, the S4 platform takes over and dispatch it to appropriate processing elements. After processing these events, PEs can choose to dispatch them to other PEs for further processing or they can choose to produce output events. Thus, arbitrarily complex behavior can be derived together by wiring a simple set of PEs.

S4 comes with a few example applications, but here is a much simpler S4WordCount application that shows how to:

  1. Keep state in a PE.
  2. Dispatch events from a PE.
  3. Process multiple events from a single PE.
  4. Write a simple java client for sending events to S4.

S4 is a distributed stream processing platform from Yahoo. It is often seen as the real-time counterpart of Hadoop. S4 being fault tolerant and horizontally scalable helps you in building very large stream processing application that can do anything from detecting earthquakes to finding that perfect bit of advertising that the visitor on your website is most likely to click.

At its core, an S4 application consists of a number of Processing Elements (PEs) that are wired together with the help of a spring configuration file that defines the PEs and the flow of events in the system. Also, events are produced by event producers that listen that sends these events to the client adapter for S4, from where, the S4 platform takes over and dispatch it to appropriate processing elements. After processing these events, PEs can choose to dispatch them to other PEs for further processing or they can choose to produce output events. Thus, arbitrarily complex behavior can be derived together by wiring a simple set of PEs.

S4 comes with a few example applications, but here is a much simpler S4WordCount application that shows how to:

  1. Keep state in a PE.
  2. Dispatch events from a PE.
  3. Process multiple events from a single PE.
  4. Write a simple java client for sending events to S4.

In S4WordCount, we will build a simple WordReceiverPE, that will receive events in the form of word and will simply print these words on stdout. It will also identify sentences in the word stream and then forward these sentences for further processing to SenteceReceiverPE. WordReceiverPE will also produce receive sentence events and print them to stdout.

First lets have a look at Word and Sentence, the event object used in our example:

package test.s4;

public class Word {
	private String string;

	public String getString() {
		return string;
	}

	public void setString(String message) {
		this.string = message;
	}

	@Override
	public String toString() {
		return "Word [string=" + string + "]";
	}
}

S4 uses keys, which is a set of some properties of the event object, for routing/dispatching events. In this case since Word is the key-less entry point into the system, and we don’t have any key for it, but for Sentence which will be processed further, we have a Sentence.sentenceId as the key. (For simplicity, all Sentence have the same sentenceId, 1)

Now let’s have a look at our first PE, i.e. WordReceiverPE:

We define a StringBuilder that will be used to accumulate words to form a Sentence. The processEvent(Word) method simply prints the received word on stdout and appends the word to the builder. It then checks if the sentence is complete by checking for . at the end of the builder and if so, it creates a Sentence event(object) and dispatches it to the Sentence stream. Once dispatched the processEvent(Sentence) will receive that event and will again print the sentence to stdout.

Now let’s have a look at SenteceReceiverPE, our second PE which does nothing but print the received Sentence on stdout.

Finally, lets see the content of the application config file that wires all this together and forms a valid S4 application. The name of the config file should follow the naming convention of <AppName>-config.xml. In this case we will call the config file S4WordCount-conf.xml

This is a spring bean definition file. The first bean, wordCatcher is an object of class test.s4.WordReceiverPE and it injects a dispatcher called dispatcher into WordRecieverPE. We will look at properties of dispatcher later. keys define the stream on which our PE will be listening for event. RawWords * means that it will be receiving all events on the stream named RawWords irrespective of their keys. Similar is the intention for Sentence *.

Our second bean is sentenceCatcher which is an object of class test.s4.SentenceReceiverPE and will accept all events on stream called Sentence.

Third is the definition for the dispatcher which we injected into wordCatcher. The dispatcher needs a partitioner that partitions the events based on some key and then dispatches them to appropriate PEs. In this case we are using the DefaultPartitioner whose properties are defined later by sentenceIdPartitoner bean which says that partition the event objects on Sentence stream by sentenceId property. dispatcher uses the S4 provided commLayerEmitter to emit the events.

Running the application

To run this application on the S4:

  1. Setup S4 as documented here.
  2. Create a S4WordCount.jar from above classes.
  3. Deploy the application on S4, by creating the following directory structure:
    	/$S4_IMAGE
    		/s4-apps
    			/S4WordCount
    				S4WordCount-conf.xml
    				/lib
    					S4Wordcount.jar
  4. Start S4: $S4_IMAGE/scripts/start-s4.sh -r client-adapter &
  5. Start client adapter: $S4_IMAGE/scripts/run-client-adapter.sh -s client-adapter -g s4 -d $S4_IMAGE/s4-core/conf/default/client-stub-conf.xml &

Now S4 is ready to receive events. Following is an event sender that uses the java client library to send events to S4. It reads one word at a time from stdin and sends it to S4.

Run client: java TestMessageSender localhost 2334 RawWords test.s4.Word

Here is the sample output on the S4 console:


Received: Word [string=this]
Received: Word [string=is]
Received: Word [string=a]
Received: Word [string=sentence.]

Using fast path!
Value 1, partition id 0
wrapper is stream:Sentence keys:[{sentenceId = 1}:0] keyNames:null event:Sentence [string= this is a sentence.] => 0
Received Sentence(WordReceiverPE) : Sentence [string= this is a sentence.]
Received Sentence: Sentence [string= this is a sentence.]

Curious case of Lewis Carrol, tournament runner ups and sorting

No less than four grand slams are played every year to find the best tennis player in the world. These are knockout tournaments in which, at each stage half of the players lose and get out. Two [supposedly] best players clash in the finals and the winner of that game is declared the champion (the best player) and the loser the runner up (the second best player). All is fine with our champion but there is just a small problem with our runner up: He might not be the second best player of the tournament.

How?

Let us assume that there were 16 players in the beginning with seedings 1 to 16, 16 being the best. Consider the following results for the matches in the tournament

15
	15
1
		15
5
	9
9
			16
12
	16
16
		16
4
	4
3
				16
11
	11
6
		11
7
	10
10
			14
2
	8
8
		14
13
	14
14

As expected, 16 is the winner but the runner up is 14 not 15. 15 crashed out in the semi finals playing against the champion.

So what can we conclude from this?

  1. Runner up may not be the second best player of the tournament.
  2. log n matches are required to decide the best player in the tournament.
  3. We can find the second best player in the tournament looking at the match results of the winner which is a list of log n players. At some point he must have defeated the second best player.

Here is a program that gives you the second best player in (n - 1) + ((log n) - 1) = n + (log n) - 2 comparisons:

Java Code Generators – A short rant

Java is known to be a verbose language and the situation worsens when you step into bloated enterprise java world. You need to write tons of code and configure a lot of JXXX to make your simple webapp work. Though the situation is improving in the recent years with the introduction of convention over configuration approach and also of annotation based configurations but still, if you compare the amount of code required for a functional webapp in Java then it would be at least 2X to 4X more than that of similar webapps written in other frameworks like Django or RoR.

A lot of java frameworks and IDEs tries to hide this complexity by generating code – from simple getters and setters to entire DAO layers and what not – that might give small productivity gains in the beginning but eventually every additional line of code in your project, either hand-written or generated by a state of art code generator, someone will need to maintain and evolve it, which according to some is almost 90% of the total software costs.

Thats why framework like Play feels like fresh air and it seems to be making inroads in the bloated enterprise java world.

My first Scheme program

I’ve just started experimenting with Scheme, specifically to follow Structure and Interpretation of Computer Programs. Here is a Scheme procedure that returns the sum of squares of two larger numbers out of the given three.

(define (sumLargeSquare a b c)
  (cond ((and (&lt; a b) (&lt; a c)) (+ (* b b) (* c c)))
        ((and (&lt; b a) (&lt; b c)) (+ (* a a) (* c c)))
        (else (+ (* a a) (* b b))))
)

'protected’ implementation in Actionscript 3 is BROKEN

Consider following code:

Theoretically, the call at line no. 13 should not cause any error as count is defined for testObj as a field in SubClass, but it gives a Cannot create property count on SubClass. Interestingly if we move modifyProperty method to SubClass or make count public, it works without any error.

So it can be said that we cannot access protected fields dynamically in a subclass.

How Facebook (probably) implements SMS threads

A lot of web applications provides you the ability to receive or post updates using SMS (text messages). Twitter is the posterboy of such applications.

But the feature that distinguishes Facebook SMS updates from any other service is that you can actually reply to individual messages and your message then becomes a part of the aforementioned thread. You don’t have to specify any special command to do so. Just hit the reply  button and you are done.

How do they know which message belongs to which thread? Its simple, each message comes from a different number. For example, say A writes something on your wall, then you will receive a message from +91922305501. Also B sends a message to you, a message comes from +91922305502 and when C comments on your note, you get a message from +91922305503. How many numbers do facebook own? Probably 100 in India, from +91922305500 to +9192230599.  A per user mapping of the number from which the message is sent and the object/thread id for that message is kept on the backend. So when the user replies to a particular number, say +91922305502 , it knows that is was a reply to the message sent by B and hence treats as a reply to that message. A rotation policy is used to decide from which number the message is to be sent. So after recieving a message from +9192230599, the next message will be from +91922305500  and now you cannot reply to the thread represented by the earlier message from the same number.

Discalimer: This is just a guess of how this functionality might have been implemented (or at least how I would implement it) and based strictly on SMSes received in India. Actual implementation might (probaly is) be entierly different from what is explained here.

Bing/Google toggle bookmarklet

I’ve been trying out Bing as my default search engine for last few months and the results are definitely at par with that of Google for most of the consumer categories. But sometimes you just need a quick toggle to see what Google would return for your query. So here is a small bookmarklet that I wrote that toggles from Bing to Google and vice versa.

Drag this to you favourites bar: Bing/Google Toogle

Note: Its tested on IE8 only. A few tweaks might be required for Firefox.

Minifying Javascript/css without changing file references in your source

Rule 10 of Steve Souders High Performance Web Sites: Minify Javascript

The most common problem faced while implemnting this is how you handle the full and minified version and how to change there reference in referencing documents. One of the easier ways to do this is to make it part of the deployment process.

Here are the relevent steps involved.

I’m using YUI Compressor.

#!/bin/bash
 
#Execute this script after checking out the latest source from repository.
 
#Minify all javascript files
cd /path/to/javascript
for x in `ls *.js`
do
        java -jar /path/to/compressor/yuicompressor-2.4.2.jar -o ${x%%.*}-min.js --preserve-semi  $x
done
 
#Minfiy all css files
cd /path/to/css
for x in `ls *.css`
do
        java -jar /path/to/compressor/yuicompressor-2.4.2.jar -o ${x%%.*}-min.css  $x
done

Now you don’t want to replace all references to x.css or x.js in your development code with references to x-min.css and x-min.js respectively. So what you can do is rewrite all those filenames at the web server level.

For apache the following rewrite rules work fine:

#enable rewriting
RewriteEngine on
RewriteRule /(.*)\.js /$1-min.js
RewriteRule /(.*)\.css /$1-min.css

Caution: Remember to delete existing minified css/js file before running the minifying script or you will end up with file names like x-min-min-min.js and so on. One way to do this is to clear the js/css folder before checking out files from your source repository.

Humanizing the time difference ( in django )

django.contrib.humanize is a set of Django template filters that adds human touch to data. It provides naturalday filter that formats date to ‘yesterday’, ‘today’ or ‘tomorrow’ when applicable.

A similar requirement which the humanize pacakge does not address is to display time difference with this human touch. so here is a snippet that does so.

Running Glassfish as a service on CentOS

Here is how you run glassfish as a service on CentOS:

  1. Create a user glassfish (you can call it anything you want) under which Glassfish will run.
    #useradd glassfish
  2. Install glassfish in /home/glassfish.
  3. Create the startup script /etc/init.d/glassfish for glassfish.
  4. Install the service
    #chmod +x /etc/init.d/glassfish
    #chkconfig --add glassfish
    #chkconfig --level 3 glassfish on
  5. Start glassfish.
    #/etc/init.d/glassfish start