Working with Pravega: StreamCuts¶
This section describes StreamCut
s and its usage with streaming clients and batch clients.
Pre-requisites¶
Familiarity with Pravega Concepts.
Definition¶
A Pravega Stream is formed by one or multiple parallel Stream Segments for storing/reading events. A Pravega Stream
is elastic, as it handles the changes in the number of parallel Stream Segments along time to accommodate
fluctuating workloads. A StreamCut
represents a consistent position in the stream. It contains
a set of Stream Segments and offset pairs for a single stream which represents the complete keyspace at a given
point in time. The offset always points to the event boundary and hence there will be no offset pointing to
an incomplete event.
The StreamCut
representing the tail of the stream (with the newest event) is an ever changing one since
events can be continuously added to the stream and the StreamCut
pointing to the tail of the stream with
newer events would have a different value. Similarly the StreamCut
representing the head of the
stream (with the oldest event) is an ever changing one as the stream retention policy could truncate the stream
and the StreamCut
pointing to the head of the stream post truncation would have a different value.
StreamCut.UNBOUNDED
is used to represent such a position in the stream and the user can use it to
specify this ever changing stream position (both head and tail of the stream).
It should be noted that StreamCut
s obtained using the streaming client and batch client can be used
interchangeably.
StreamCut with Reader¶
A Reader Group is a named collection of Readers that together, in parallel, read Events from a given stream. Every
Reader is always associated with a Reader Group. StreamCut
(s) can be obtained from a Reader Group using the
following APIs:
- getStreamCuts()
: The API io.pravega.client.stream.ReaderGroup.getStreamCuts
returns a
Map<Stream, StreamCut>
which represents the last known Position of the Readers for all the streams managed by the Reader Group.
generateStreamCuts()
: The APIio.pravega.client.stream.ReaderGroup.generateStreamCuts
, generates aStreamCut
after co-ordinating with all the Readers usingio.pravega.client.state.StateSynchronizer
. AStreamCut
is generated by using the latest Stream Segment read offsets returned by the Readers along with unassigned segments (if any). The configurationReaderGroupConfig.getGroupRefreshTimeMillis()
decides the maximum delay by which the Readers return the latest read offsets of their assigned segments. TheStreamCut
generated by this API can be used by the application as a reference to a Position in the stream. This is guaranteed to be greater than or equal to the position of the Readers at the point of invocation of the API.
A StreamCut
can be used to configure a Reader Group to enable bounded processing of a Stream. The start
and/or end StreamCut
of a Stream can be passed as part of the Reader Group configuration. The below example
shows the different ways to use StreamCut
s as part of the Reader Group configuration.
/*
* The below ReaderGroup configuration ensures that the readers belonging to
* the ReaderGroup read events from
* - Stream "s1" from startStreamCut1 (representing the oldest event) upto
endStreamCut1 (representing the newest event)
* - Stream "s2" from startStreamCut2 upto the tail of the stream, this is similar to using StreamCut.UNBOUNDED
* for endStreamCut.
* - Stream "s3" from the current head of the stream upto endStreamCut2
* - Stream "s4" from the current head of the stream upto the tail of the stream.
*/
ReaderGroupConfig.builder()
.stream("scope/s1", startStreamCut1, endStreamCut1)
.stream("scope/s2", startStreamCut2)
.stream("scope/s3", StreamCut.UNBOUNDED, endStreamCut2)
.stream("scope/s4")
.build();
The below API can be used to reset an existing Reader Group with a new Reader Group configuration instead creating a Reader Group.
/*
* ReaderGroup API used to reset a ReaderGroup to a newer ReaderGroup configuration.
*/
io.pravega.client.stream.ReaderGroup.resetReaderGroup(ReaderGroupConfig config)
StreamCut with Stream Manager¶
StreamCut
representing the current head and current tail of a stream can be obtained using the StreamManager
API getStreamInfo(String scopeName, String streamName)
.
/**
* Get information about a given Stream, {@link StreamInfo}.
* This includes {@link StreamCut}s pointing to the current HEAD and TAIL of the Stream.
*
* @param scopeName The scope of the stream.
* @param streamName The stream name.
* @return stream information.
*/
StreamInfo getStreamInfo(String scopeName, String streamName);
StreamCut with BatchClient¶
BatchClient
can be used to perform bounded processing of the stream given the start and end StreamCut
s. BatchClient
API io.pravega.client.batch.BatchClient.getSegments(stream, startStreamCut, endStreamCut)
is used to
fetch segments which reside between the given startStreamCut
and endStreamCut
. With the retrieved segment information, the user can consume all the events in parallel without adhering to time ordering of events.
It must be noted that passing StreamCut.UNBOUNDED
to startStreamCut
and endStreamCut
will result in using the current head of stream and the current tail of the stream, respectively.
We have provided a simple yet illustrative example of using StreamCut here.