Windows in StreamInsight: Hopping vs. Snapshot
- by Roman Schindlauer
Three weeks ago, we explained the basic concept of windows in StreamInsight: defining sets of events that serve as arguments for set-based operations, like aggregations. Today, we want to discuss the so-called Hopping Windows and compare them with Snapshot Windows. We will compare these two, because they can serve similar purposes with different behaviors; we will discuss the remaining window type, Count Windows, another time. Hopping (and its syntactic-sugar-sister Tumbling) windows are probably the most straightforward windowing concept in StreamInsight. A hopping window is defined by its length, and the offset from one window to the next. They are aligned with some absolute point on the timeline (which can also be given as a parameter to the window) and create sets of events. The diagram below shows an example of a hopping window with length of 1h and hop size (the offset) of 15 minutes, hence creating overlapping windows: Two aspects in this diagram are important: Since this window is overlapping, an event can fall into more than one windows. If an (interval) event spans a window boundary, its lifetime will be clipped to the window, before it is passed to the set-based operation. That’s the default and currently only available window input policy. (This should only concern you if you are using a time-sensitive user-defined aggregate or operator.) The set-based operation will be applied to each of these sets, yielding a result. This result is: A single scalar value in case of built-in or user-defined aggregates. A subset of the input payloads, in case of the TopK operator. Arbitrary events, when using a user-defined operator. The timestamps of the result are almost always the ones of the windows. Only the user-defined operator can create new events with timestamps. (However, even these event lifetimes are subject to the window’s output policy, which is currently always to clip to the window end.) Let’s assume we were calculating the sum over some payload field: var result = from window in source.HoppingWindow(
TimeSpan.FromHours(1),
TimeSpan.FromMinutes(15),
HoppingWindowOutputPolicy.ClipToWindowEnd)
select new { avg = window.Avg(e => e.Value) };
Now each window is reflected by one result event:
As you can see, the window definition defines the output frequency. No matter how many or few events we got from the input, this hopping window will produce one result every 15 minutes – except for those windows that do not contain any events at all, because StreamInsight window operations are empty-preserving (more about that another time).
The “forced” output for every window can become a performance issue if you have a real-time query with many events in a wide group & apply – let me explain: imagine you have a lot of events that you group by and then aggregate within each group – classical streaming pattern. The hopping window produces a result in each group at exactly the same point in time for all groups, since the window boundaries are aligned with the timeline, not with the event timestamps. This means that the query output will become very bursty, delivering the results of all the groups at the same point in time. This becomes especially obvious if the events are long-lasting, spanning multiple windows each, so that the produced result events do not change their value very often. In such a case, a snapshot window can remedy.
Snapshot windows are more difficult to explain than hopping windows: they represent those periods in time, when no event changes occur. In other words, if you mark all event start and and times on your timeline, then you are looking at all snapshot window boundaries:
If your events are never overlapping, the snapshot window will not make much sense. It is commonly used together with timestamp modification, which make it a very powerful tool. Or as Allan Mitchell expressed in in a recent tweet: “I used to look at SnapshotWindow() with disdain. Now she is my mistress, the one I turn to in times of trouble and need”.
Let’s look at a simple example: I want to compute the average of some value in my events over the last minute. I don’t want this output be produced at fixed intervals, but at soon as it changes (that’s the true event-driven spirit!). The snapshot window will include all currently active event at each point in time, hence we need to extend our original events’ lifetimes into the future:
Applying the Snapshot window on these events, it will appear to be “looking back into the past”:
If you look at the result produced in this diagram, you can easily prove that, at each point in time, the current event value represents the average of all original input event within the last minute. Here is the LINQ representation of that query, applying the lifetime extension before the snapshot window:
var result = from window in source
.AlterEventDuration(e => TimeSpan.FromMinutes(1))
.SnapshotWindow(SnapshotWindowOutputPolicy.Clip)
select new { avg = window.Avg(e => e.Value) };
With more complex modifications of the event lifetimes you can achieve many more query patterns. For instance “running totals” by keeping the event start times, but snapping their end times to some fixed time, like the end of the day. Each snapshot then “sees” all events that have happened in the respective time period so far.
Regards,
The StreamInsight Team