Buffering and Asynchronous user interaction

The problem

An implementation of the decoder architecture described above must take into account two constraints which quickly become contradictory if they have not been integrated in the architecture from the very start:

  • User interaction is asynchronous and takes place with what is currently displayed on screen and what is played by the speakers.

  • What is displayed on screen is not what is being read from the source medium because of the buffering explained in the previous section.

The key to understand the issues discussed below is to understand the fundamental difference between two very different timebases. The display timebase represents the events which occur on screen and through the speakers while the reading timebase represents the events which occur during data reading from the source device.

If the reading module is reading the data which corresponds to the movie time 00:42:05 (42 minutes and 05 seconds), the display module will receive and display this data only later (between 0.5 and 2 seconds later, depending on the amount of buffering). Generally speaking, any event detected by the reading module in the stream read, will also be triggered by the display module but only later.

The main consequence is that, for example, the time displayed by a DVD player must be based on information provided by the display module and not on information provided by the reading module. Otherwise, users would be able to notice a skew between TV display and DVD time. This might seem obvious for the display of the movie time but it is less obvious for other context information such as the chapter number and the title number currently played by a DVD player: the A/V stream of the movie is artificially split in multiple chapters which represent the logical layout of the movie storyboard or video sequences. The A/V stream itself is a contiguous stream, from the start of the movie to the end of the movie and the description of the chapter layout is stored in DVD tables which are stored in the magic .ifo files of a DVD.

These tables contain start and end pointers to locations in the A/V stream which identify the start and end of each chapter. Typically, DVD software reads these tables before starting playback of the movie and displays on the small LCD screen of the DVD player the number of the chapter currently played. This is where things start to get ugly.

An obvious very simple implementation of such a feature would consist of a small piece of code which would run in the reading module and which would trigger an event everytime the start or end of one of these chapter is crossed by a reading request. Of course, such an implementation would create a skew between what is displayed on the LCD screen and what is displayed on the TV screen: for a very brief time, at every change of chapter, the chapter number displayed by the LCD screen would not be the chapter number of the data displayed on TV but the chapter number of the next chapter.

I am sure most readers would say that it's kind of okay: the chapter number displayed on the LCD screen is the right one 99.99% of the time, who cares about these very brief moments ? Which user will ever notice ? And if they notice, will they care at all ? Unfortunatly, this small detail has some rather serious user-visible consequences.

The most serious consequence is that, for one second, during the chapter change, the user might trigger a next chapter or a previous chapter command. Both of these commands would be based on the chapter number displayed on the LCD screen which would result in skipping entirely the following chapter for the next chapter command and in going to the begenning of the current chapter for the previous chapter command. Clearly, none of these behaviors are really acceptable and their visibility is very high. None of the DVD players I tested (I did test quite a lot of them) did show such a Bbehavior.

The example described above is not one of its kind. All of the user-triggered asynchronous commands are sensitive to similar problems and all of these problems come from just one source: the confusion between the context with which the user interacts (that is, the display context and its timbase) and the reading context. This confusion is often triggered by the ease of implementation of the reading context which is then confused with the display context. It is also important to notice that maintaining an accurate display context can be very difficult technically.

Implementors of interactive A/V decoders should thus take a very conservative approach to this problem. Not solving it will lead to a lot of incredible complexity added to all pieces of software because, sooner or later, the testing lab or a customer will find a bug related to this issue and consumer-quality devices must work reliably at all times to ensure low return rates which, in turn, ensure low costs and high profits. This is especially true in the DVD player market because the price of DVD players is dominated by the licensing, marketing and customer-support costs. Building the player and buying its components costs at most 20 USD with licensing fees reaching up to 30 USD: an 80 USD player thus represents a very small profit.

DVD players are also especially sensitive to this problem because the DVD-Video specification is highly interactive and a lot of asynchronous events are continuously triggered. A 100% correct implementation requires careful planning ahead of the development itself.

A solution

A complete description of the synchronization of the synchronous and asynchronous events of a DVD player is very complex and involves numerous tasks:

  • control the correct conversion between frame-based pictures and field-based pictures,

  • control the synchronization between audio, video, subpicture and highlight streams,

  • control the display of user information on the LCD screen,

  • maintain an accurate player state to ensure a correct reaction to the asynchronous user events

I must say I have been unsuccessfully trying to find an elegant and reliable solution to all these problems taken together. The common factor of all the solutions I came up with, implemented and tested is that either they did not solved a corner case (ie: they were not reliable) or they were horribly convoluted, complex and their inherent complexity made the final product unreliable. Some of the reasons for these failures came from existing legacy interfaces which had to be used and were inherently unreliable (by the design of the interface) and buggy.

The following sections try to explain what I think might be a useful generic architecture which could be used to implement reliably all the rather complex features described before. One of the key aspects of the architecture I describe below is that it does not mandate the system to be implemented that way (even though I believe it should :). Instead, it mandates an interface to the system: implementors might try and succeed in implementing this interface in a very different way from the abstract model presented here.

The streaming graph

Conceptually, the player is presented by a set of nodes which can process data independently from each other: that is, each node contains at least one thread. These nodes are connected by a set of FIFOs. The behavior of each node can be controlled by the global controller through a set of properties specific to each node. The controller overviews the state of each node and ensures that the graph as a whole is in a coherent state. The value of each property can be modified at any time, very quickly (ie: changing the value of a property is a non-blocking operation) and it should be taken into account for the next fifo entry processed by the node.

The controller exports a high-level API to the outside world. For example, the controller would export a Pause and a Fast Forward function. A typical DVD graph would contain a reading node, a demux node, a video, an audio and a subpicture decoder node, and a video and audio output node: the figure below shows such a graph.

The main loop of a node is relatively simple: its goal in life is to get entries from its input fifo and process them. Each entry in a fifo is either a data entry (that is, it references data to process if the node is in the running state) or a tag entry. Whenever a tag reaches a node, the node notifies its controller that the tag has just arrived. Depending on whether the tag entry is synchronous or asynchronous [1], the notification function call can block or not. If the target id of the entry does not match the id of the node [2] , the entry is further propagated on the output fifos of the node. The entry is discarded otherwise. Each tag entry is given an id, unique within all tags such that the controller can recognize a tag when it flows through the graph fifos and nodes.

void main_loop (void) 
{
        while (1) {
                entry = input_fifo.pop_entry ();
                switch (entry.get_type ()) {
                case TAG:
                        if (entry.is_sync ()) {
                                controller.notify_sync (node, entry);
                        } else {
                                controller.notify_async (node, entry);
                        }
                        if (!entry.match_target (node)) {
                                for (output_fifo = output_fifos.first (); 
                                     output_fifo.is_last (); 
                                     output_fifo.next ()) {
                                        output_fifo.push_entry (entry);
                                }
                        } else {
                                dump_entry (entry);
                        }
                        break;
                case DATA:
                        if (is_running ()) {
                                process_and_output_data (entry);
                        } else {
                                dump_entry (entry);
                        }
                        break;
                }
        }
}

As mentionned in the title, this design is inherently streaming-based: it is supposed to work most of the time in an equilibrium state either where the fifos of the graph are full or when they are completely empty of data.

When the graph is empty of any kind of data, even if the nodes are in the running state, they are blocked in their call to input_fifo.pop_entry () which cannot return since the fifos are empty.

When the graph is filled with data, there are two factors which drive the system to equilibrium, based on the critical behavior of the output nodes:

  • the total amount of memory available for audio/video buffers: if there is no memory left for the audio/video buffers, the reading node cannot trigger new read transactions with the DVD drive. If no new data is sent in the graph, it will slowly empty itself, (unless the output nodes are stopped) thus returning some memory to the audio/video buffer pools and thus allowing the reading node to restart. The reading node thus oscillates around the equilibrium.

  • the number of entries in each finite-size fifo: if an output fifo is full, the node cannot send data into this fifo and thus blocks. Unless the output nodes are stopped, the output fifos will be emptied by the node connected to the other side of the fifo which will allow the node to restart.

If the output nodes do not process data from their input fifos, they will bring the graph to a halt: the fifos will fill themselves and will block the mainloops of all nodes. If the output nodes process data, they will empty their input fifos, thus allowing other nodes to push data into these fifos.

Use scenarios

This design allows one to trivially implement the very important flushing primitive: when the controller decides that it is time to flush completely the graph or a part of the graph, of the data and the tags present in the node fifos, it switches the relevant nodes to the non-running state through the relevant properties and then pushes in the top-level fifo a special EOS tag (End Of Stream tag). The target id of this EOS tag matches none of the nodes in the graph so that the tag is propagated throughout the whole graph.

In this mode, the part of the graph which got switched to the non-running state is going to empty its fifos of all data present and will notify the graph controller of each tag present in the fifos, including the EOS tag. Whenever the controller receives an EOS tag notification from a node, it switches the node back to the running state through the relevant property. This ensures that all data queued in the graph before the EOS tag was queued is properly flushed.

This design can be used to provide the information needed to synchronize the audio, subpicture and video streams: whenever the demux node sees an Mpeg2 pack and its associated PTS (presentation time stamp), the demux creates a new tag entry which contains the PTS and pushes it in the relevant output fifo (that is, if the pack contained audio, the demux creates a tag entry which contains the PTS and pushes it in its output audio fifo). If all the other nodes propagate this tag, it will be easy for the controller to know whether audio or video is late and thus to decide on what to do to make them be both on time.

This design can also be used to maintain a bit-accurate chapter number for DVD for example: when the top-level reading node crosses a chapter boundary, it can insert a special chapter number tag which will be used by the controller when the tag has reached both the video output and the audio output to update the current chapter number.

There are many other uses of the tag primitive: it is possible to use it to implement a very beautiful vobu-accurate repeat-AB feature for example. [3]. When the reading node reaches point B on the source, it inserts a new B point tag into the stream and then keeps on reading data from the source. When the B point tag reaches the audio and video output nodes, these nodes notify synchronously the controller which switches both of them to the pause state to ensure they do not present more data. The controller then flushes the graph and resets the reading node to point A. The key here is that if the user canceled the repeat AB mode, the controller does nothing when it sees the B point tag and playback continues, thus ensuring smooth playback through the B point. Smooth playback is possible here because the graph was not emptied of its data: there is no starvation in the audio and video output nodes. If, instead, the reading node had merely stopped reading data after point B, and if the user canceled the repeat AB mode just before point B is displayed, the graph would be empty, thus resulting in audio and video starvation while playback should have continued from the point of view of the user.

DVD-Video experts will recognize in the repeat AB feature a close cousin of the VOBU Still feature which triggers a Still (something similar to a pause) at the end of each VOBU played unless the user did hit the Play button.

Important details

There are a few details which I did not catch in the design described above: output nodes have a very special role in the state of the graph as a whole. If they are paused, they pause immediatly the whole graph. This is extremly important to ensure truly immediate reaction to user events. As such, it is important to ensure that output nodes react very quickly to any change in the value of their properties.

Typically, for a video output, it would need to be able to react within a 25th of a second (for SECAM/PAL output) or within a 30th of a second (for NTSC output). VGA output might require different timings. Audio output control will require similar accuracy.



[2] The matching process can be a bit complex if needed: it is possible to define broadcast or group target ids such that they match all or a group of node ids. However, as far as I can tell, there should be no need for such a feature.

[3] DVD afficionados know that all DVD players support a useless feature named repeat AB which makes the player loop the audio/video playback between points A and B (A is in B's past). When the player reaches point B, playback freezes and then restarts at point A unless the user canceled the repeat AB state and reverted to normal playback.