code is data: 2010

20101113

Google ready to kill software patents?

Google on Oracle vs Google: "Each of the Patents-in-Suit is invalid under 35 U.S.C. § 101 because one or more claims are directed to abstract ideas or other non-statutory subject matter."
CUDOS Google! Refusing software patents like this the right thing to do for innovation!
More at groklaw: http://www.groklaw.net/article.php?story=20101111114933605

20101101

Java finally adds NIO2.

Java 7 comes with NIO2, "New I/O version 2", stupid name I know, but it's packing some extremly important functions.
New functions that will enable us to do faster indexing, trace changes in filesystems and read more file attributes such as users and groups.
I have been waiting for this since early 2002 when the poposal for NIO2 came. I almost gave up hope on Java since then.
This makes it possible to update some core I/O functions in corpus and in our public java libraries for indexing.

20100107

Gentle local file indexing, please

Unlike websites, local file systems tend to give much better feedback on file changes. Still, most search-solutions use considerable I/O, something that is very annoying. Users are annoyed to the extent that they completely uninstall the search and indexing altogether. - I've done that with google, microsoft and other desktop search tools too.

Still I know that there is a difficult balance to all this. Today there is good OS support for file events. I recently read this post about using .NET API:s to monitor changes in file systems. There are also Linux versions as INotify or the kernel deamon auditd to do the same by listening to kernel events. The manual OS-independent method is to watch the modified time stamp for changes on all folders. Worst case is to have to scan the entire folder tree for changes, as if virus-scans where not annoying alone.

The event monitoring solutions work as long as they are on, but changes go unnoticed while the listening agent are off, and they need to fall back to scan for changes mode if they are switched on again - costing considerable annoying IO activity. Then comes the indexing that really pick up the IO...

Solutions:
* Late indexing and grouped indexing before searching, just log changes until then.
* While idle, create a low priority process for indexing groups of changed/new documents in one phase
* Push and convert documents by type to be nice to I/O by i.e. converting all doc files to XML at once.
* Push indexing to a remote server to reduce the load.

Anything else to consider for desktop/enterprise search?
Jonas