wiki:Development/Implementation

[ URI Schema | Development | The Client API ]

Implementation

In this section, I describe three central aspects of the implementation:

  • Message logging, as it is the central mechanism in Nabu
  • Execution of user requests, especially the optimizations that were needed to execute SPARQL queries in a reasonable time frame
  • Logging of multi-user chats, as it differs from the logging of one-to-one chats in and needs some more effort to be mapped to the Nabu ontology.

Logging a message

http://www-user.rhrk.uni-kl.de/~f_osterf/nabu/logmessage_seq.png

The sequence diagram above shows the logging of a one-to-one message.

To receive all packets sent on the server, Nabu registers a PacketInterceptor at the XMPP Server. The server passes every processed XMPP packet in a relatively low-level fashion (as a global packet sequence, with duplicates and without any grouping or sorting) to the interceptor, so Nabu can filter relevant packets and log them.

The Interceptor takes the packet, does basic filtering (discarding irrelevant packet types), casts the packets to subtypes (Message, Presence) and passes them to the LoggerImpl. LoggerImpl now checks whether the packet should be logged, by reading the account settings of the message sender from the Archive. If message logging is enabled, it continues checking whether the message is equal to the last message logged, to avoid duplicates. This is neccessary as every one-to-one message is intercepted twice, when entering the server (coming from the sender) and when forwarded to the message receiver. As packet IDs are optional, the Logger falls back on comparing message subject and body when no ID is set: If subject and body are equal, the message is considered as duplicate and discarded.

If the message passes the dupe test, Logger converts it to RDF (using Jena2Java) and stores it in the model. The currently active privacy policy is attached by adding the statement (message, nabu:hasPolicy, activePolicy) to the internal model.

User Requests

The processing and execution of user requests is separated in several steps. One important design goal was to separate request representation, parsing and execution from each other. That means that the protocol used over the Nabu bot is encapsulated and separated from the Nabu core. This makes it easy to implement other protocols than XMPP for archive access, like e.g. XMLRPC, by implementing a few interfaces.

When the user sends a request, the Bot receives the request string in an XMPP message. The bot passes the request string to its implementation of the RequestParser interface. The parser returns the corresponding Request object, which is an abstraction from the protocol syntax. The bot now passes the request to the RequestExecutor implementation that was passed to the bot at creation time. The request executor executes the request, having full access to the RDF store. It creates a corresponding response object and returns that to the bot. The bot again uses its ResponseEncoder implementation to convert the response into a string and sends it back to the requestor.

To ensure privacy, it is important that every interface receiving user requests verifies the identity of the requestor. We get that for free when using XMPP (as the user is identified and authenticated when sending the request to the bot), but other implementations have to take care of this, otherwise the privacy model would be compromised.

Executing queries

The request execution is pretty much straight-forward and fast for most of the requests and so I won't go into details here for each and every request type. Some more thought was needed for the implementation of SPARQL, executed using the QUERY request. The SPARQL interface must cope with different requirements:

  • It must respect the privacy model. This includes the removal of non-readable resources from query results ("post-filtering"), but there are also more subtle cases that can't be addressed by simple post-processing of query results: The result of a query like "Show me all accounts that exchanged messages in the last 7 days" would return a list of account pairs. Account resources are globally visible, so the post-processing would just leave them in the result, even if the corresponding messages are filtered out. But this is not sufficient: Not only hiding the actual message content is required, but also hiding the information that both users have exchanged messages is important. So non-accessible resources must be filtered before query execution.
  • Query execution must scale up to large archives. Considering that the archive size increases linearly over time, and a Nabu installation could log messages over years, archives can get considerably large. Nabu should return query results in a reasonable time frame even when run on such large archives. Also, Nabu must scale in terms of memory consumption.

The privacy model requires the CBDs of privacy-protected resources to be hidden from unauthorized users. When working on a global RDF graph as Nabu does, this means that the graph most be pre-processed for each query to create the individual "view" of the requesting user, removing the statements that he is not authorized to access. It was decided to do this preprocessing instead of e.g. managing an individual model for every user, duplicating huge parts of the model, which would be error-prone (consistency problems) and make basic actions like logging a message or changing a policy expensive tasks.

The pre-processing algorithm in pseudo-code:

Model filterModel(Model m)
   toRemove = Model.new # creates an empty model in memory
   for each statement (resource, internal:hasPolicy, policy) 
            in archive.internalModel
   begin
       if policy.deniesAccess(account)
          toRemove.add(CBD(resource)) # add CBD of non-accessible resource
       end
   end
   return m.difference(toRemove) # return the mainModel without 
                                 # the statements in toRemove
end

There are two major bottlenecks in this algorithm that make a naive implementation slow (preprocessing on small archive containing 10000 statements took about 40 seconds, which is of course unacceptable). The first bottleneck is the check whether a policy allows access for every resource is expensive. Checking it for every protected resource creates an enormous overhead. Fortunately this can be solved quite easily: As the number of policies is very small compared to the number of protected resources (usually a user will have only a couple of policies), caching the deniesAccess result reduced the time needed for filtering by about 50%.

The second bottleneck is the creation of the CBDs. This was addressed by using a global CBD cache: The CBDCache class caches requested CBDs. As CBDs may change when statements are added to or removed from the model, the cache registers as listener at the model and invalidates the cache entries of statement subject and object when receiving an addedStatement or removedStatement event. Using these two optimizations, the time for filtering was reduced by about 95 percent.

Another approach tested was caching of the filtered models: When a user starts a query, the filtered model is added to a (user, model) map. If the same user starts another query, Nabu looks if there is an entry in the map and reuses the model if so. The advantage of model caching is that the costs for filtering reduce to zero on a cache hit, where filtering even with CBD caching is still an O(n) operation. Disadvantages:

  • the cached model is not in sync with the main model and so user queries would operate on potentially outdated data. Expiring caches using a time limit decreases the hit ratio, and syncing all caches model on addedStatement/removedStatement is far more expensive than maintaining a CBD cache.
  • On a cache miss, the full time for model filtering is needed. This makes the first user query very slow, and the speed of successive queries unpredictable (so a query may take either e.g. 1 second or 1 minute, which could confuse users).
  • memory consumption: For a main model of size m, a cache of size n needs O(n*m) memory.

Model caching is implemented in ModelCache, but not activated in favour of CBD caching. A combination of CBD and model caching would be possible though. This would combine zero costs on a model cache hit and lower the costs for filtering on cache miss.

Multi-user Chat Support

Jabber was, as many other instant messaging protocols, primarily designed for one-to-one chat between two users only. So the core of the Jabber protocol does not support multi-user chat (MUC in short). MUC support is provided by a protocol extension instead, defined in the Jabber Enhancement Proposal(JEP) 0045. The concepts used in group chat as defined in JEP 45 are different to the concepts of one-to-one chat: Where one-to-one chat is uses the typical instant messaging concepts, MUC is similar to chat systems as IRC. Nabu does not support the full feature set of JEP 0045 but only the most important part of it: the logging of text messages sent in chat rooms.

Ideally, there would be no strict separation of one-to-one and MUC chat: It should be possible to switch from a two-user to a n-user chat by simply adding more persons to the existing chat. Two-user chat would be just a special case of a n-user-chat. The Nabu ontology uses this concept and abstracts from the details of the MUC protocol as far as possible: A MUC message is simply a message with multiple receivers instead of one, and has a MUCRoom linked as "location". The higher level of abstraction from the protocol makes the ontology simpler, but also the implementation in Nabu a bit more complex.

Before explaining how MUC logging is implemented, lets first have a look how MUC works in principle. The central events important for message logging are joining a room, sending messages, and leaving the room:

Joining a Room

When Alice wants to join support@…, she first sends a presence packet to the room:

<presence
    from='alice@jabber.foo.org/Psi'
    to='support@conference.jabber.foo.org/SuperAlice'>
  <x xmlns='http://jabber.org/protocol/muc'/>
</presence>

The "to" attribute contains the JID of the MUC room plus the desired nick for that room. When entering a MUC room, every user must choose a nick. This nick is used to identify him inside this room. The "real" Jabber ID "alice@…" is not necessarily visible to other room occupants. So Alice chooses the (admittedly not very creative) nick "SuperAlice".

Sending Messages

After entering the room, Alice might want to send a (broadcast) message to the other occupants. This is done by sending a message from her nick to the room JID:

<message
    from='support@conference.jabber.foo.org/SuperAlice'
    to='support@conference.jabber.foo.org'
    type='groupchat'>
  <body>Hello everybody!</body>
</message>

When receiving the message, the MUC component broadcasts the message to all room occupants:

<message
    from='support@conference.jabber.foo.org/SuperAlice'
    to='bob@jabber.foo.org'
    type='groupchat'>
  <body>Hello everybody!</body>
</message>
<message
    ...
    to='charlie@jabber.foo.org'
</message>

Exiting a Room

The user exits the room by sending a presence packet with type="unavailable" to her nick JID:

<presence
    from='alice@jabber.foo.org/Psi'
    to='support@conference.jabber.foo.org/SuperAlice'
    type='unavailable'/>

Logging MUC Messages

To implement MUC support, there are three issues Nabu has to address:

Intercepting MUC messages. Nabu does this by registering a special interceptor for the MUC subdomain (usually "conference") at the XMPP server. All packets sent to or from the MUC domain are passed to the interceptor. To reduce complexity, MUCInterceptor creates a MUCRoomHandler for every room and passes messages sent in a room to the responsible handler.

Resolving room nicks to "real" Jabber IDs. Apart from using different room subclasses for the room concept (MUCRoom/P2PRoom), Nabu abstracts completely from MUC details and stores one-to-one and MUC messages in the same manner. Thus Nabu does not store MUC nicks but the actual Jabber ID of the user. For instance, the example from above will be stored with alice@… as nabu:sender and bob@…, charlie@… as nabu:receivers.

Keeping track of room members. To set sender and receivers correctly, Nabu must listen to join room and exit room events and maintain a list of the accounts being room occupants. This is very important, as wrong nabu:receivers links influence privacy: accounts incorrectly linked as receivers can read messages they never received.

Both the nick -> Jabber ID mapping and the occupant list are managed per room by the corresponding room handler.

[ URI Schema | Development | The Client API ]

Last modified 13 years ago Last modified on 07/30/05 23:47:43