Search outside of database: facing setup issues

Submitted by itsvasu on Wed, 2010-12-01 22:06.General Discussions

I need to use "Search outside of Database" implementation for my requirement. I have table with id, file_path, created_on columns.

I though of implemnting following DumbFetcher

public class DumbFetcher extends AbstractFetcher {

    @Override
    public void execute(Properties p, long lastRunTime, List<Column> columns) {
  1. . Execute: select id, file_path, created_on from files_list
  2. . For each record
  3. . <<FileTypeFilter>> ft = new <<FileTypeFilter>>()
  4. . ft.add("id", id);
  5. . ft.add("file_stream", new <<Custom_API_to_get_String>>(file_path).toString());
  6. . ft.add("created_on", created_on);
  7. . scheduleDocument(ft);
	}	
    @Override
    public List<FieldType> getFieldTypes(Properties p) {
        List<FieldType> ret = new ArrayList<FieldType>();
        ret.add(new NumberFieldType("id").setPrimaryKey(true));
        ret.add(new StringFieldType("file_content"));
        ret.add(new TimestampFieldType("created_on").setModifiedTime(true));
        return ret;
    }	

1. Is the Above implementation correct?

2. I couldnt understand the use of List<Column> columns argument to execute method. Please explain or guide me to reference document.

3. "Code line 3" should use <<FileTypeFilter>> (for example DocTypeFilter, CvsTypeFilter etc..) or always should use TextDocument? How to integrate new File Type Filters to the DumbFetcher?

4. "Code line 5" should always pass content of the file or does it accept FileTypeFilter?

5. How to handle transaction i.e. if any document failed indexing how to reindex it? is it taken care by DBSight/lucene or my DubmbFetcher should have implemented with handler for fail over mechanism?

6. How to inform lucene about a document delete from the source?

7. While configuring Index through admin UI facing few problems. Even after selecting CustomFetcher, index configuration screens keep asking main query and sub query. i feel "Search out side DB" shouldn't ask this configuration. Do you have any screen casts on how to setup search outside DB?

8. Why primary key should be of keyword type? I don't want my primary key to be searched since its a internal unique number and user shouldn't know about it. As per description keyword is also included in search index

9. Last step giving following message "Scheduling indexing is not included in Free License Level. Please upgrade to allow DBSight to work for you". How do i configure custom fetcher and go ahead with indexing and search?

Thanks for your time.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Submitted by will on Thu, 2010-12-02 12:47.

Thanks for all the questions! I will compile your questions and the following answers to a wiki page.

1. This is good. Additionally, you can include lastRunTime, if you want to have incremental indexing.

2. List of columns are for selected columns. The configuration UI allow you to choose some columns instead of the full list.

3. Only TextDocument now. Currently different document types are not supported. But it's not a bad idea to support it.

4. No, For now. I think you can use some POI library to process each specific file types.

5. Failed case are not handled by DBSight.

6. Deletion is not handled here. You can use REST call to delete it from DBSight directly.

7. It should not ask for these. Please let me know your exact problem.

8. Primary Key as Keyword is to lookup of duplicated documents. You can configure the Primary Key not to be searchable.

9. You can use Dashboard to start an indexing.

Submitted by will on Thu, 2010-12-02 17:58.

6. deletion is handled via scheduleDeleteDocument(TextDocument doc) function. The doc can contain only the primary key column value.