Partial row update

Submitted by gminet on Tue, 2007-12-04 01:57.General Discussions

Hi Will

Our project based on dbsight has been accepted and we just started development. Has the document number and size will be quite large and retrieval for indexing will be costly (we'll fetch and decrypt documents live from an encrypted file vault), we'd like to anticipate performance problems as early as possible.

In the file vault, access rights to documents are strictly restricted based on user name. So, in order to avoid filtering dbsight results sets in the client application (fetched through a web service) we plan to include a metadata column containing the user lists (and maybe groups too if necessary), so we can specify the username in the query and avoid to present documents whose access would be later denied to the user (and also avoid showing confidential extracts/summaries to wrong users). That would greatly simplify the client application code too versus a "live filtering and repaging" step.

Does this sounds a good idea with dbsight ?

However that means that whenever the user list changes for a document, we must change the modification date in the master dbsight table (actually, a view) so it can be reindexed. We thus fear that user additions/deletion to frequently used groups ("almost everybody group") would trigger a large index update and massive file vault access (as de documents would need to be decrypted again even if the document itself has not been modified).

So here is my question: is there a way or an applicable trick to only update partial indexes (rows) (In fact I don't even know if lucene itself supports this) and could this be used in dbsigh ? I'm thinking to solutions like:

- using multiple modification times in the main query and/or in subqueries ?
- using multiple indexes if we can run a query across indexes ?
- returning a special constant in the document blob that would tell dbsight it can keep the previously indexed datas for this row (seems the most easy way to go) ?
- any other idea or advice ?

thank you
kind regards

Gaetan

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Submitted by will on Tue, 2007-12-04 10:10.

Hi, Gaetan,

Filtering on the server side will definitely speed up data retrieval.

Your approach is to tag each document with a list of users, but the concern is that if one user is added/deleted in a group, all the group's document will be affected.

In this case, you should tag each document with groups. So when a user search for something, you will retrieve the user's groups, and use those groups to filter search results.

This "group" filter can with together with the "user" filter also, which can be "lq=(group:x or user:tom)".

The user's group can be read from another index. Or simply directly from the database.

Sounds good?

Will

Submitted by gminet on Wed, 2007-12-05 03:46.

Hi Will

Thank you for your reply.

Indeed we have inquired the possibility to include groups in the index but we have concerns with the current applications as as there might be "negative" permissions on documents ie

user U1 is a member of G1, doc D1 can be read by groups G1 and G2 but not by U1 (per document exclusion, à la ntfs).

That negative permission could probably be indexed too in fact (users+, groups+, users- and groups- index) but that would duplicate the whole access logic in dbsight.

Also a lot of other metadatas could change (like "file path" if a user renames a big directory, but not so often or not so much at a time) and that could help if we can avoid stress on the file vault.

We'll dig into the group possibility a bit more however, that might eventually be enough.

What do you mean with these sentences :

The user's group can be read from another index. Or simply directly from the database.
?

Is "another index" meaning another field/row in the dbsigh database ("group"), or another whole dbsight index (cross-index querry ?).

Thanks a lot

Kind regards

Gaetan

NB: In case of performance problems, are you providing custom code adaptations in dbsight ? That's something I could propose to the customer if this happens to be eventually necessary.

Submitted by will on Wed, 2007-12-05 16:23.

Partial updating is hard to do in Lucene, because it's inverted, meaning one change needs to be distributed to each search terms.

"another index" meaning a whole different dbsight index. You will need to issue a different http call for it.

"Customizing code" is possible, depending on the effort.

Submitted by gminet on Mon, 2007-12-10 02:06.

Hi Will

If I use another index and need to issue a separate http request, I'll have to manually join the result sets in the client application, and eventually filter them right ? That would defeat the purpose of indexing the credentials,

Thanks for your help.

Gaetan

Submitted by will on Mon, 2007-12-10 09:16.

You can query database or search some index to get a list of the user's groups, and forbidden docs. Then you can issue query like

 group:abc AND -doc:23434

It should be pretty quick.