The common approach to implement Lucene Search is to create Lucene index right after the content is submitted. It works fine in a single server, limited volume.
But if your server has several servers, all generating new content for search. In this case, it's difficult to implement Lucene search. Well, you can create Lucene index on each server, merge them together (?), and distribute the index back to each server...
And if the volume is high, merging Lucene indexes become ugly slow. Merging indexes intrinsically is a slow process. It has to go through all documents you have.
This common approach doesn't sound scalable, by either number of servers or number from documents.
This is the email that we sent to Joe Ottinger, site editor of TheServerSide.com, showing him why DBSight is good and useful.
Glad to be able to talk with you. I come to realize you are also active on Lucene mailing list, and you are also working on LuceneRAR, right?
Since you know Lucene well, I can get more detailed. I am really passionate about DBSight. Please bear with me. The demo is here: http://search.dbsight.com. A step by step tutorial on how this demo is created is here: http://wiki.dbsight.com/index.php?title=Step_by_step.
DBSight is very simple to use, very powerful inside.
Anyone who has taken a peek at the demo search probably notices the two Narrow By (navigation) boxes that are not seen in typical web search engines like Yahoo! or Google. This is one of my favorite features of DBSight, because it adds a flavor of database query to the search, allowing user to filter or count the search results by certain columns (year and genre in the demo search). This feature is also unique to structured data search tools, so you won't see it
SF.net has been doing the same thing DBSight does, except that DBSight is an off-the-shelf product.
Thanks to Chris Conrad! He has written an excellent article on details of SF.net's search project. (I guess it's me who requested this on the lucene mailing list.) At least we know how a big search project usually looks like.
As you may know, DBSight is basically lucene-on-database. Best of all, I've found DBSight is already doing most of their requirements. And it's off-the-shelf!
For IT managers, there may be one or several databases lying around. They want to do search on the database, while the annoying search always returns "No Results Found". The "Advanced Search", if there is one, is often complicated and hard to use, performance is often slow and resource-consuming.
Why not create a search as simple and elegant as Google? Well, your reply may be, "I don't think I can do that".
Actually you can!
With DBSight, you surely can create a search engine on your database, by Do-It-Yourself!
DBSight is a free-to-download J2EE application. It
- Has a scheduler to crawl database updates by JDBC
No doubt SQL Server 2005 has significant improvements on full-text search, as their own benchmark will tell. But to do the search, a developer needs to translate user's query into well-formed SQL queries. It's a vendor-specific language. And the full-text search has to run inside SQL Server, on the same computer.
So if your database volume goes up and you need to upgrade, you have to buy more licenses for a higher price, since you are bound to SQL Server. And you have to buy a more powerful computer, probably a high-end mainframe, for more CPU and memory resources.
- DBSight supports all JDBC-capable databases, including older versions of SQL Server that Microsoft ditched.
Search relevance determines the quality of a search engine. An ideal search engine should return a list of search results where the first ranked one is the most reliable and useful result to its targeted audience.
Google's proprietary technology PageRank determines the relevance largely by popularity: "In essence, Google interprets a link from page A to page B as a vote, by page A, for page B." More weighted votes a page gets, more popular it becomes to Google.
For structured data, such as a database that DBSight searches, PageRank no longer applies because there's no such a thing as link structure. Then, how to "calculate" a document's popularity to help improving the search relevance?
With the help from freedb.org, I posted our demo search to freedb.org's news page.
Currently, it's on the front page of http://www.freedb.org
I have just completed the tutorial on how to create a full-text search on freedb data.
The data quality is not really ideal. Since there are many duplicated entries and erroneous inputs. I am going to clean it later when I am free.
My goal is to show how easy it is to create a full-text search on your database.
The tutorial is here: http://www.dbsight.net/mediawiki/index.php?title=Step_by_step
The search is here: http://search.dbsight.com
There is a good article on Enterprise Search.
And another one: Why Writing Your Own Search Engine is Hard
Some marketing for DBSight:
Our aim is to enable administrators super easy to create a full-text search on any relational database, with better performance, and staying cheap.