Stored Streams¶

Overview¶

Scanner represents input and output data as sequences of data items called StoredStream. Stored streams are Python objects that describe to Scanner how to read data to be processed and how to write data after it has been processed by a Scanner application. Stored streams can represent data stored in a variety of locations or formats:

Video files (mp4, mkv, etc).
Collections of files (images, text, etc)
Packed binary files (custom RAW formats).
SQL tables (image metadata)

For example, the following code creates a stored stream for a video file named example.mp4:

import scannerpy as sp
sc = sp.Client()
video_stream = sp.NamedVideoStream(sc, 'example', path='example.mp4')

NamedVideoStream is a special type of stored stream which represents data stored inside Scanner’s internal datastore. Since Scanner was built specifically for processing video, it has specialized support for fast access to frames in videos, even under random access patterns. In order to efficiently read frames from a video, Scanner needs to build an index over the compressed video. By specifying path = 'example.mp4, we’ve told Scanner to initialize a stream named example from the example.mp4 video.

Another example of a stored stream is the FilesStream, which represents a stream of individual files stored on a filesystem or cloud blob storage:

from scannertools.storage.files import FilesStream
image_paths = ['image1.jpg', 'image2.jpg', 'image3.jpg']
file_stream = FilesStream(sc, paths=image_paths)

By default, FilesStream reads from the local filesystem. However, like all stored streams, the storage location and configuration options can be specified with a StorageBackend.

Te rest of this guide explains the additional features of stored streams.

Storage Backends¶

StorageBackend s represent the specific storage location or format for stored streams. For example, FilesStream can be configured to read files from Amazon’s S3 storage service instead of the default local file system by creating the appropriate storage backend:

from scannertools.storage.files import FilesStorage, FilesStream
image_paths = ['image1.jpg', 'image2.jpg', 'image3.jpg']
file_storage = FilesStorage(storage_type='s3',
                           bucket='example-bucket',
                           region='us-west-1')
file_stream = FilesStream(sc, paths=image_paths, storage=file_storage)

I/O Operations¶

I/O operations allow Scanner applications to read and write to stored streams. To read from a stored stream, Scanner applications construct Input() operations, specifying a list of stored streams:

frame = sc.io.Input([video_stream])

This code creates a sequence of video frames, frame, that can be used in the context of a Scanner computation graph to read the video specified by video_stream (to learn more about computation graphs, check out the Computation Graphs. guide). Stored streams are also used to specify where to write data to:

output_video_stream = NamedVideoStream(sc, 'example-output')
frame = sc.io.Output(frame, [output_video_stream])

Here, the frames we read in from before will be written back out to a NamedVideoStream called example-output.

Reading Data Locally¶

Stored streams can be read directly in Python by calling the load() method:

for frame in video_stream.load():
    print(frame.shape)

Reading from this stream lazily loads video frames from video_stream as numpy arrays. If we were reading bounding boxes or some other data format, the load method would return data elements formatted according to the data type of the stream.

Deleting Stored Streams¶

Stored stream data is persistent: unless a stored stream is explicitly deleted, the data will stay around and can be used in future Scanner applications. A stored stream can be deleted by invoking the delete() method:

video_stream.delete(sc)

If there are multiple streams to delete, it can be more efficient to invoke a bulk delete operation by calling delete() on the storage backend itself:

video_stream.storage().delete(sc, [...])

Inplace video indexing¶

By default, Scanner copies the video data for a NamedVideoStream to Scanner’s internal database (located at ~/.scanner/db by default) when it builds the index over the video for fast frame access. For a limited set of video container formats (currently only MP4), Scanner also supports accessing videos without copying them using the inplace=True flag:

input_stream = NamedVideoStream(sc, 'sample-clip', path='sample-clip.mp4',
                                inplace=True)

This still builds the index for accessing the video but avoids copying the files. When Scanner accesses the video data, it will read it directly from the path provided to the named video stream.