Stage-Level Job Pipelining in Apache Spark: A Deep Dive

How we improved TracksETL throughput by understanding Spark’s execution model

Introduction

Back in 2018, we migrated TracksETL (one of the largest ETL job) from Hive to Spark to handle our exploding event volumes. At that time, we processed around 2.28 billion events monthly. Fast forward to today, and we’re processing over 15 billion events — a 6x increase. Throughout this growth, our Spark-based ETL has continued to perform well, largely thanks to an optimization we implemented: stage-level job pipelining.

In this post, I’ll walk through the evolution of our job scheduling strategy, from naive sequential processing to a sophisticated stage-aware pipelining approach that maximizes cluster utilization.

Understanding the Problem

TracksETL processes event data by day partitions. Due to data volume, we can’t process all days simultaneously — each day becomes a separate Spark job within a single application. A typical job has three distinct stages:

  1. READ – Read from HDFS (etl_events, prod_useraliases, etc.). This is very fast as it’s all sequential reads and parallelism is limited to the number of files on HDFS.
  2. PROCESS – Process and deduplicate data. This is computationally expensive and entails shuffling large amounts of data to join and de-duplicate.
  3. WRITE – Write back to HDFS. Since we don’t want to write data into a large number of small files, output is coalesced and then written. This is the slowest stage and is I/O bound.

The WRITE stage is particularly problematic: it’s I/O bound and uses fewer tasks (~120) than our available cores (~300), leaving resources idle.

The Evolution of Our Scheduling Strategy

Iteration Zero: Sequential Processing

The simplest approach — process one day at a time:

Result: Works with minimal resources, but painfully slow. The timeline was pretty boring and it never made it to production.

Iteration One: Parallel Jobs with par

As each stage was already individually optimized, it was not possible to get this running faster without running multiple jobs in parallel and throwing more hardware at it. This can be done by triggering Spark Actions for independent jobs in parallel. In Scala, this would look like:

jobs.par.foreach(job => job.compute)

Here, jobs is a list of jobs to be computed and compute triggers one of the Spark actions. par converts the list to a parallel list and hence they are submitted to Spark in parallel. By default, it will run four jobs in parallel.

This is a reasonably straightforward way of running parallel jobs without introducing any code complexity. In fact, for most cases this should be enough.

Problem: Jobs run in batches. The entire batch must complete before the next starts. In the diagram above, you can see that Jobs 2 and 3 finished before Job 1, but no new job is submitted until all jobs in the first batch are finished. This leads to some executors sitting idle when only a few tasks from one of the WRITE stages remain.

Iteration Two: Job-Level Pipelining

The next thing to try was to start a job as soon as one of the jobs is over:

I never implemented this approach as there was an even better approach.

Iteration Three: Stage-Level Pipelining (The Final Solution)

While describing the jobs earlier, I mentioned that the first two stages were fast/optimized but the third stage was I/O bound and slow. A WRITE stage is usually broken into around 120 tasks while we have around 300 cores dedicated for the TracksETL application.

Also, since there is considerable variation in volume of events that we get every day, it often happens that the WRITE stage of various parallel jobs start at different times. During each batch, we end up in a state with only WRITE stages running which are painfully slow.

To alleviate this issue, we can start the next job as soon as one of the already running jobs starts their WRITE stage. This will slow down the existing job a bit but not considerably, while we gain from starting the new job much earlier by not waiting for the whole job to finish. The wall-clock time for all the WRITE stages is amortized over the execution of READ and PROCESS stages.

The key insight: don’t wait for jobs to finish — start new jobs when running jobs enter their WRITE stage. Since WRITE is I/O bound and uses fewer cores, we can overlap it with the CPU-intensive READ and PROCESS stages of new jobs.

Implementation Deep Dive

The implementation uses Spark’s StatusTracker API to monitor job and stage progress. Here’s how it works:

Architecture Overview

The diagram below shows the overall architecture of our stage-level pipelining implementation:

The system consists of three main components:

  1. BlockingQueue – Holds all partition groups (days) to be processed. Each day’s data becomes a work item in the queue.
  2. Scheduling Loop – The main control loop that polls the StatusTracker every 20 seconds to check the state of running jobs and decides whether to submit new work.
  3. Thread Pool – An ExecutionContext with a fixed number of worker threads that execute the actual Spark jobs. Each worker pulls from the queue and processes partitions independently.

The Scheduling Decision Algorithm

The flowchart below illustrates the decision logic that runs every 20 seconds:

The algorithm follows these steps:

  1. Check capacity – If the number of active jobs is less than our target concurrency (tasks), submit a new job immediately.
  2. Query job stages – If we’re at capacity, use StatusTracker to get the current stage of each running job.
  3. Identify final stages – For each running job, find its maximum stage ID (which corresponds to its WRITE stage).
  4. Check for overlap opportunity – If all running jobs are currently executing their final (WRITE) stage, it’s safe to submit a new job because WRITE is I/O bound and won’t compete for CPU resources.

This approach ensures we maintain high throughput without overloading the cluster during CPU-intensive stages

Core Implementation

val statusTracker = sparkSession.sparkContext.statusTracker

  val blockingQueue = new ArrayBlockingQueue[PartitionGroup](
    partitionGroups.size,
    false,
    partitionGroups.asJava
  )

  while (!blockingQueue.isEmpty) {
    val activeJobIds: Array[Int] = statusTracker.getActiveJobIds()

    val newJob: Option[Future[Option[Seq[Long]]]] =
      if (activeJobIds.length < tasks) {
        // Simple case: we have capacity
        Some(Future { processPartitions(blockingQueue.poll().partitions) })
      } else {
        // Check if running jobs are in their final (WRITE) stage
        val runningJobStatuses = Seq(JobExecutionStatus.RUNNING, JobExecutionStatus.UNKNOWN)

        val runningJobInfos: Array[Option[SparkJobInfo]] =
          activeJobIds
            .map(statusTracker.getJobInfo)
            .filter {
              case None    => true
              case Some(x) => runningJobStatuses.contains(x.status())
            }

        if (runningJobInfos.length <= tasks + 1) {
          // Get the maximum (final) stage ID for each running job
          val maxStageIdsForRunningJobs: Array[Int] =
            runningJobInfos.flatten
              .filter(_.status() == JobExecutionStatus.RUNNING)
              .map(_.stageIds().max)

          val runningStageIds: Array[Int] = statusTracker.getActiveStageIds()

          // If all jobs are on their final stage, submit another
          if (maxStageIdsForRunningJobs.diff(runningStageIds).isEmpty) {
            Some(Future { processPartitions(blockingQueue.poll().partitions) })
          } else {
            None
          }
        } else {
          None
        }
      }

    newJob.foreach(j => futures.add(j))
    Thread.sleep(20000) // Poll every 20 seconds
  }

  // Wait for all jobs to complete
  Await.result(Future.sequence(futures.asScala), Duration(2, TimeUnit.HOURS))

The Key Insight Explained

The magic happens in this condition:

maxStageIdsForRunningJobs.diff(runningStageIds).isEmpty

Let’s break it down with an example:

Example: 3 jobs running, each has stages [0, 1, 2] where stage 2 is WRITE

  • Job A: stages [10, 11, 12] → max = 12
  • Job B: stages [20, 21, 22] → max = 22
  • Job C: stages [30, 31, 32] → max = 32

So maxStageIdsForRunningJobs = [12, 22, 32]

Case 1: Jobs still in PROCESS stage

runningStageIds = [11, 21, 31]  (middle stages)
  diff = [12, 22, 32] - [11, 21, 31] = [12, 22, 32]  (not empty)
  → DON'T submit new job (would overload cluster)

Case 2: All jobs in WRITE stage

runningStageIds = [12, 22, 32]  (final stages)
  diff = [12, 22, 32] - [12, 22, 32] = []  (empty!)
  → SUBMIT new job (WRITE is I/O bound, has spare CPU capacity)

Trade-offs and Considerations

Advantages

BenefitDescription
Better utilizationI/O-bound WRITE stages overlap with CPU-bound READ/PROCESS
Reduced wall-clock timeWRITE time is amortized across multiple job starts
No job changes neededScheduling is external to job logic
Graceful fallbackSet jobSubmissionThreads <= 1 for sequential mode

Disadvantages

Trade-offDescription
Code complexityRequires understanding Spark internals
Polling overhead20-second intervals add some latency
Memory pressureMultiple concurrent jobs increase memory usage
Debugging difficultyInterleaved jobs can be harder to trace

When to Use This Pattern

Good fit:

  • Many independent partitions to process
  • Jobs have distinct stage boundaries (READ → PROCESS → WRITE)
  • Final stages are I/O bound with lower parallelism
  • Cluster can handle 2-3 concurrent jobs

Not needed:

  • Simple, short-running jobs
  • Jobs with uniform resource usage across stages
  • Memory-constrained environments

Conclusion

By understanding Spark’s execution model — particularly how jobs decompose into stages with different resource profiles — we can implement intelligent scheduling that keeps our cluster busy. The key insight is that not all stages are equal: I/O-bound final stages can share resources with CPU-bound early stages of new jobs.

This optimization has allowed TracksETL to scale from 2 billion to 15+ billion events while maintaining acceptable processing times. The additional code complexity is worth it for workloads at this scale.

The StatusTracker API is underutilized in the Spark ecosystem. If you’re running large batch jobs with predictable stage patterns, consider whether stage-aware scheduling could improve your throughput.

Async is better than sync

Joining any new company is always a learning opportunity. When I started working at Automattic, it was a great learning opportunity for me as I had never worked fully remote earlier. The most important thing that I learned during my first month at Automattic is that async communications is often times much better than sync communication.

Before joining Automattic, emails were the the only async communication that I did during as part of my job. But I avoided it as much as I could. This was probably because most of the email interfaces/clients are not designed for dialogue with multiple person. I always preferred going to meetings instead of discussing things through email. Even though meetings worked, it was difficult to get everyone in the same room on a short notice and then, there was also the overhead of keeping minutes of meeting and sending it as an email to everyone involved.

As I started working at Automattic, sync discussions were possible with my team but not very convenient as almost everyone was in a different timezone. I had to schedule hangouts well in advance to do any high bandwidth discussion as Slack is not appropriate for long form discussion. To overcome these scheduling issues, I embraced P2s and started communicating through them as much as possible. As I started using it more and more, I discovered that putting my ideas in writing made things more coherent and clearer for me leading to much more efficient discussions. Also, since I didn’t have to wait for others availability, I can just post on P2 and keeping working on other stuff while my team responded to the post. This also meant that we were not interrupting each other and could keep up and discuss things as per our availabilities and priorities.

Thus, communicating asyn through P2s was not only better than sync to overcome meeting scheduling headaches but also because it resulted in much more clearly stated and well thought problem statements and their resolutions.

Seek patterns of Elasticsearch

One must know the disk seek patterns of an application when optimizing the storage layer for any application. So when I was working on the performance analysis of ES, one of the first thing was to determine its disk seek patterns. I used blktrace and seekwatcher for that purpose.

blktrace is a utility that traces block io request issued to a given block device. To quote from the blktrace manual:

…blktrace receives data from the kernel in buffers passed up through the debug file system (relay). Each device being traced has a file created in the mounted directory for the debugfs, which defaults to /sys/kernel/debug

For using blktrace, the first step is to mount debugfs. It is a memory based file system to debug linux kernel code.

# mount -t debugfs debugfs /sys/kernel/debug

The next step is to setup the disk on which we need to the tracing. This should be a separate disk as we do not want other OS activity to influence our readings. For this, I’ve used /dev/sdc as the data directory of ES. Now, start ES and wait for a few seconds after the cluster comes into green state, so as to be sure that we are only tracing the disk activity while searching and not the startup. Before firing the ES query, start blktrace on /dev/sdc with the following command

blktrace -d /dev/sdc -o es-ssd-search

This will start tracing the block requests on the disk and send output to es-ssd-search.blktrace.0 .. es-ssd-search.blktrace.n-1 where each file represents requests from one core. Since I was using a quadcore CPU, I got the following files:

es-on-ssd.blktrace.0
es-on-ssd.blktrace.1
es-on-ssd.blktrace.2
es-on-ssd.blktrace.3

Now that we have data from blktrace, the next step is to visualize it. That can be done by blkparse utility. It formats the blktrace output into human readable form. But as they say, a picture is worth a thousand words, there is another tool, namely seekwatcher that can produce plots and movies from the output of blktrace. I used it to visualize the seek patterns. To encode movies with seekwatcher, we also need to install MEncoder. Once seekwatcher and MEncoder are installed, run the following command to generate the movie:

# seekwatcher --movie-frames=10 -m -t es-on-ssd.blktrace.0 -o es-on-ssd.movie

It will produce es-on-ssd.movie that can be played with MPlayer. Following is the output that I got:

As can be seen, apart from a few random reads most of the reads are sequential-reads for which I went on to optimize the storage.

Reference:

Bonus video: ES startup seeks

ES Start Seek Patterns from Anand Nalya on Vimeo.

Clustering of synthetic control data in R

This is an R implementation for clustering example provided with Mahuot. The orignal problem description is:

A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and so resembles real world information in an anonymized format. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets. At the end of this example, you’ll get to learn how to perform clustering using Mahout.

We will be doing the same but using R instead of Mahout. The input dataset is available here.

For running this example, in addition to R, you also need to install the flexclust package available from CRAN. It provides a number of methods for clustering and cluster-visualization.

Here is the script:

x <- read.table("synthetic_control.data")
cat( "read", length(x[,1]), "records.\n")

# load clustering library
library(flexclust)

# get number of clusters from user
n <- as.integer( readline("Enter number of clusters: ")) 

# run kmeans clustering on the dataset
cl1 <- cclust(x, n)

print("clustering complete")

# show summary of clustering
summary(cl1)

# plot the clusters
plot(cl1, main="Clusters")

readline("Press enter for cluster histogram")
m<-info(cl1, "size") # size of each cluster
hist(rep(1:n, m), breaks=0:n, xlab="Cluster No.", main="Cluster Plot")

readline("Press enter for a plot of distance of data points from its cluster centorid")
stripes(cl1)
print("done")

Here are the graphs produced when we run the above script with no. of clusters, n=7

Clusters

clusters

Frequency Histogram

frequency

Distance from centroid

centroid distance

Building a Single Page Webapp with jQuery

In a typical single page web app, you load a html page and keep updating it by requesting data from the server either periodically or whenever the user asks for. Such an application is well suited for data driven systems. For example, if you already have a REST API, you can easily create a single page web client for it. The best way of understanding what a single page web app is by looking at the web interface of Gmail

So let us define what we want to achieve in the end with our application:

  • We should be able to load different views on the same page.
  • We should be able to translate user intentions (clicks) into appropriate javascript function calls.
  • The url shown in the address bar of the browser should represent the state of the application. That is, when the user refreshes the browser, it should have the same content as prior to refresh.

The Root Page

So let us begin by defining the basic structure of our application. index.html is our application page that will be modified during the lifetime of our application. For the sake of simplicity, let us assume that we will be making changes to #application div only.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
        <script type="text/javascript" src="./scripts/sugar.js"></script>
        <script type="text/javascript" src="./scripts/jquery-1.7.js"></script>
        <script type="text/javascript" src="./scripts/app.js"></script>
</head>
<body>
    <div id="doc">
        <div id="hd">
                Header/Navigation goes here
        </div>
        <div id="mainContent" class="content">
                <div id="app" role="application">
                        This is the content area 
                </div>
        </div>
        <div id="ft">
                Footer goes here 
        </div>
</body>
</html>

Intercepting link clicks

Now we need to intercept all the click actions on the links and change the url hash accordingly. For handling the url hashes, we are using the Jquery BBQ plugin. Intercepting the link clicks will help us in preventing the default navigation to other pages. Also, as this is the required behaviour for all the links, we will setup the interceptor at global level. If required you can change the scope of this interception by modifying the jQuery selector where we define the interception.

$(document).ready(function() {

	// convert all a/href to a#href
	$("body").delegate("a", "click", function(){
		var href = $(this).attr("href"); // modify the selector here to change the scope of intercpetion
		
		 // Push this URL "state" onto the history hash.
		$.bbq.pushState(href,2);

		// Prevent the default click behavior.
		return false;
	});
});

Defining the Routes

Next, we need to define the routes that are mapped to url hash. Also, since each hashchange represents an action on part of user, we will also define what function to execute for each hashchange. We do this by listening for haschange event and firing the appropriate js function.

$(document).ready(function() {

	// Bind a callback that executes when document.location.hash changes.
	$(window).bind( "hashchange", function(e) {
		var url = Object.extended($.bbq.getState()).keys();
		
		if(url.length==1){
			url = url[0];
		}else{
			return;
		}
	    
		// url action mapping
		if(url.has(/^\/users$/)){
			showUserList();
		} else if (url.has(/^\/users\/\d+$/)){ // matching /users/1234
			showUser(url)
		}
		// add more routes
	});
});

Initialization

Even though, we have proper routes and actions defined now, our index page is still empty. We now need to initialize the app. This is done by setting the default hash, which will fire the hashchange event when the page loads or in case the user has refreshed the page, just triggering the hashchange event manually.

$(document).ready(function() {	
	if(window.location.hash==''){
		window.location.hash="#/users"; // home page, show the default view (user list)
	}else{
		$(window).trigger( "hashchange" ); // user refreshed the browser, fire the appropriate function
	}
});

Sample Action

Now that we have the skeleton for our application ready, lets have a look at a sample action. Suppose the default action, is showing a list of users [#/users]. We have mapped this hash to the function showUserList. Now there are several ways of combining the html structure and the json data for userList, but we will be using the simplest of methods. First we will get the html fragment representing the userList using the $.load function, and then we will populate it with actual data that we get from the REST api.

var showUserList = function(){
	$('#app').load('/html/users.html #userListDiv', function() {
		$.getJSON('/users', function(data) {
			
			// create the user list
			var items = [ '<ul>'];
			$.each(data, function(index, item) {
				items.push('<li><a href="/users/"' + item.id + '">'
						+ item.name + '</a></li>');
			});
			items.push('</ul>');
			
			var result = items.join('');
			
			// clear current user list
			$('#userListDiv').html('');
			
			// add new user list
			$(result).appendTo('#userListDiv');
		});
	}
};

If you find all the string concatenation going on in above snippet a bit sloppy, you can always use your favourite template plugin for jQuery.

Note: The Object.extended function in above snippets comes from the wonderful Sugar.js library that makes working with native javascript objects a breeze.

Idea for Windows Phone Team – Make switching easier and less costlier

Microsoft is already paying developers for porting their applications from iOS and Android. This is all well for the developers but what about the end users. They have already invested a lot in the apps on those platform. So the cost of switching is not just the hardware/carrier-contract cost but also buying all those apps again for WP7. Also, what happens to all the data that is stored in the apps on other platforms?

Microsoft, here is an idea for: pay extra to developer to enable importing data from other platforms. And make all those apps free for the consumers who already purchased those apps on some other platform.

Role based views using css and javascript

In web applications, role based views are normally done using server side if..else tags etc. When developing pure javascript clients, one way of doing role based views is again using javascript conditionals. This approach suffers from the fact that role-related code is scattered throughout the document. Another simple way to achieve the same is using css classes.

Let us assume that we have three roles, ADMIN and USER and GUEST. Then we will define three css classes as follows:

.role_admin, .role_user, .role_guest{
    display:none;
}

Now consider the following html doc, that has css classes attached to elements according to the current user’s role.

<html>
<head>
<link href="./role.css" rel="stylesheet" type="text/css" />
</head>
<body>
<h1 class="role_admin">This should be visible to admin only.<h1>
<h2 class="role_user">This is for the normal logged in user.</h1>
<h3 class="role_guest">And this is for the guest.</h1>
</body>
</html>

At this stage if you load the document in a browser, you will not see anything as nothing all the roles are invisible. So lets spruce this up with a bit of javascript to set proper css class properties.

$(document).ready(function() {
    // We are assuming that jQuery and jQuery Cookie plugin are included in the doc

    var role = $.cookie('user_role'); // Should return one of 'ADMIN', 'USER' OR 'GUEST'

    // set property for relevant css class
    var cssClass = '.role_'+ role.toLowerCase();
    $("<style type='text/css'> "+ cssClass + "{ display:block } </style>").appendTo("head");
});

If a cookie role is set that represents the current user’s role then adding the above snippet of code will enable all the elements tagged with css class .role_{roleCookie} to be visible.

Even though this is a very simple implementation, we can modify it easily to take into account more complex scenarios. For example, if you have completely ordered role based visibility (access level), i.e. given role R1 and R2, we can always tell which role has more visibility, then we can extend as follows:

$(document).ready(function() {
    // We are assuming that jQuery and jQuery Cookie plugin are included in the doc

    var role = $.cookie('user_role'); // Should return one of 'ADMIN', 'USER' OR 'GUEST'

    // set property for relevant css class
    if(role == "ADMIN") {
        // make everything visible
        $("<style type='text/css'>.role_admin, .role_user, .role_guest {display:block} </style>").appendTo("head");

    } else if(app.role == "USER") {
        // make user and guest part visible, admin part will remain invisible
        $("<style type='text/css'>.role_user, .role_guest {display:block} </style>").appendTo("head");

    } else if(app.role == "GUEST") {
        // only show the guest part
        $("<style type='text/css'>.role_guest {display:block} </style>").appendTo("head");
    }
});

Caution: One thing to note here is that all the parts of the document are sent to all the users, only what is visible on the browser is role based. So, server side authorisation checks will always be there as nothing stops a suspecting user from looking into the source and firing that dreaded request.

Hadoop Tuning Notes

This is a quick dump of my notes for hadoop tuning.

General

  • Hadoop is designed to use multiple cores and disks, so it will be able to take full advantage of more powerful hardware.
  • Don’t use RAID for HDFS datanodes as redundancy is handled by HDFS. RAID should be used for namenodes as it provides protection against metadata corruption.
  • Machine running the namenode should run on a 64-bit system to avoid 3GB limit on JVM heap size.
  • In case the cluster consists of more than one rack, it is recommended to tell Hadoop about the network topology. Rack awareness will help Hadoop in calculating data locality while assigning MapReduce tasks and it will also help HDFS to choose replica location for files more intelligently.
  • Two processors will be engaged by datanode and tasktracker, and the remaining n-2 processors can have a factor of 1 to 2 extra jobs.
  • On Master node, each of namenode, jobtracker and secondary namenode takes 1000M of memory. If you have a large number of files than increase the JVM heap size for namenode and secondary namenode.

Configuration Parameters

HDFS

dfs.block.size
The block size used by HDFS which defaults to 64MB. On large clusters this can be increased to 128MB or 256MB to reduce memory requirements on namenode and also to increase the size of data given to map tasks.
dfs.name.dir
should give a list of directories where the namenode persists copies of data. It should be one or two local disks and a remote disk such as nfs mounted directory so that in case of node failure, metadata can be recovered from the remote disk.
dfs.data.dir
specifies the list of directories used by datanodes to store data. It should always be local disks and if there are multiple disks then each directory should be on different disk so as to maximize parallel read and writes.
fs.checkpoint.dir
list of directories where secondary namenode keeps checkpoints. It should use redundant disks for the same reason as dfs.name.dir
dfs.replication
number of copies of data to be maintained. It should be at least 2 more than the number of machines that are expected to fail everyday in the cluster.
dfs.access.time.precision
The precision in msec that access times are maintained. It this value is 0, no access time are maintained resulting in performance boosts. Also, Storage disks should be mounted with noatime which disables last access time updates during file reads resulting in considerable performance gains.
dfs.datanode.handler.count
Number of threads handling block requests. In case on multiple physical disks, the throughput can increase by increasing this number from default value of 3.
dfs.namenode.handler.count
Number of threads on Namenode. This number should be increase from default value of 10 for large clusters.

MapReduce

mapred.local.dir
is the list of directories where intermediate data and working files are store by the tasks. These should be a number of local disks to facilitate parallel IO. Also, these partitions should the same which are used by datanodes to store data (dfs.data.dir)

mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum
specifies the maximum number of map and reduce tasks that can be run at the same time. It should be a multiple of number of cores.
mapred.job.tracker.handler.count
Number of server threads for handling tasktracker request. Default value is 10 and the recommended value is 4% of the tasktracker nodes.
mapred.child.java.opts
increase the JVM heap memory of tasks that require more memory.

Others

core-site.xml::io.file.buffer.size
This is buffer size used by hadoop during IO which default to 4KB. On modern system it can be increased to 64KB or 128KB for performance gains.