20100107

Gentle local file indexing, please

Unlike websites, local file systems tend to give much better feedback on file changes. Still, most search-solutions use considerable I/O, something that is very annoying. Users are annoyed to the extent that they completely uninstall the search and indexing altogether. - I've done that with google, microsoft and other desktop search tools too.

Still I know that there is a difficult balance to all this. Today there is good OS support for file events. I recently read this post about using .NET API:s to monitor changes in file systems. There are also Linux versions as INotify or the kernel deamon auditd to do the same by listening to kernel events. The manual OS-independent method is to watch the modified time stamp for changes on all folders. Worst case is to have to scan the entire folder tree for changes, as if virus-scans where not annoying alone.

The event monitoring solutions work as long as they are on, but changes go unnoticed while the listening agent are off, and they need to fall back to scan for changes mode if they are switched on again - costing considerable annoying IO activity. Then comes the indexing that really pick up the IO...

Solutions:
* Late indexing and grouped indexing before searching, just log changes until then.
* While idle, create a low priority process for indexing groups of changed/new documents in one phase
* Push and convert documents by type to be nice to I/O by i.e. converting all doc files to XML at once.
* Push indexing to a remote server to reduce the load.

Anything else to consider for desktop/enterprise search?
Jonas