Dec 2, 2011

How should code search work?

Hey all. As you probably all know, this is the Code Recommenders project blog we typically use to announce new features, releases or other noteworthy topics related to Code Recommenders. This blog post is a bit different than the others: It's our first "guest blog post" written by Tobias Boehm, a master student doing his master thesis in the scope of the Code Recommenders project.
This post is basically a brain dump how he thinks code search engines like Google Code Search, Krugle, or Koders *should* work and how we should be able to use them. He introduces an early prototype of a code search query language (which he will implement using Xtext) and a client tightly integrate into the Eclipse IDE. His work will be based on the previous code search engine and Eclipse client we already blogged about a few months ago here: "Why is Google Codesearch not google for code search?"


This blog post is a "heads-up! Your feedback is wanted!" post. So, please do not hesitate to ask tough questions or provide any other kind of feedback. All kind of feedback is appreciated and will help Tobias to catch idea bugs early! If you are interested in joining the work on code search engines, get in contact via the Code Recommenders forum. We are looking forward to your feedback. 


Thanks,
Marcel



How should code search work?
We all know that learning an API is hard. But we do it day by day by day...  When learning a new framework we often ask things like "Which classes should I extend?", "Which methods should I invoke on this object?", "How can I create an instance of this particular type?", "How does the code of others look like that is similar to mine?". Hopefully, some documentation is available that answers these kinds of questions. But how many frameworks do you know that have such excellent documentation? If you know a few: How many do you know which have not?

Let's assume that there is some documentation somewhere. Who wants to work through heaps of documents just to know how obtain an instance of the famous IStatusLineManager? In my opinion, the best possible documentation already exists. It's existing and tested code which is available in masses in code repositories where these API are actually used, the classes are extended, the methods called and the objects actually instantiated. The code is there, it just has to be found!

But searching for source code is still a tedious task. Although many search engines exists and even Google has a product targeted at code search - they are not too useful in certain situations. First it seems most of them treat source code the way web search engines treat websites they index - as plain text. While that might make sense for some code search use cases - it is not enough for most others. Moreover they are too generic to be useful. Most code search engines seem to identify the programming language the code is written in, yet they are not using the language-specific semantic that lies underneath.

And then there is "availability". For a developer to be able to search for source code this source code has to be indexed by the search engine. So the code available is always restricted to open source code publicly available in the web with all personal and company repositories being unused.

Lastly, there is no IDE integration. What that means is that every time we want to issue a query we are shifting focus away from our IDE to a website.

In this blog post, I describe may plans on how to implement a Code Search engine with Apache Lucene. I'll go through a set of sample queries and explain what get's indexed and how developers can query the index to solve common day-to-day tasks.

If you are interested in code search and are maybe seeking for good alternative to (almost closed) Google Code Search or want to build your own code search engine for your own company - please continue reading and don't hesitate to ask questions about it here or in the Code Recommenders forum.


Query
As said above, search query capabilities of todays code search engines are somewhat limited. Code Recommenders might come to rescue here. The heart of this prototype currently in development is a novel query language. This query language must be very simple to use. We want to create queries very easily. Yet it must be so powerful that we can express all the requests we might have to a code base. What might these requests be? Before we dig into the search criteria let's take a step back and think about what it is we would like to find. Are we interested in source files? Probably not. Java developers don't think in source files. At the bottom level we think in classes, methods and maybe even smaller blocks of code. Then these are probably the units of code we want to find. Now what questions might a developer have? She might for example want to find methods that have a certain name. The query will look like the following.

METHODS WHERE Name IS "set.*"

This query will return those methods with a name starting with "set". Let's say we are interested in methods that add something to a java.util.Set.

METHODS WHERE CalledMethods CONTAINS {java.util.Set.add}

By combining multiple search criteria and using negation the developer is able to refine the query to get exactly what she needs.

METHODS WHERE ReturnType IS {org.eclipse.jface.action.IStatusLineManager}
AND +IsPublic AND !IsStatic

A Query that will search for public, non-static methods that should return an instance of IStatusLineManager. When using the prefix "+" and "!" the developer can explicitly mark criteria as mandatory or non-occurring. If we omit the prefix the condition is optional and the results we get might not meet the criteria. In this particular example we would want the query to look like this.

METHODS WHERE ReturnType IS {+org.eclipse.jface.action.IStatusLineManager}
AND +IsPublic AND !IsStatic

This way we can be sure that the methods we find will return IStatusLineManager. What if we - for whatever reason - are interested in public methods that use an IStatusLineManager, are annotated with SuppressWarnings and that should be constructors? Here's the query for that.

METHODS WHERE IsConstructor
AND UsedTypes CONTAINS {+org.eclipse.jface.action.IStatusLineManager}
AND +IsPublic
AND ANNOTATED WITH {+java.lang.SuppressWarnings}

This is really just supposed to be an example of how detailed we can get. There are many more criteria available and they are not bound to just methods. Many of the criteria applicable to methods can be applied to classes too. We might as well search for classes with a certain name.

CLASSES WHERE Name IS "set.*"

With a more complex query for example we search for abstract classes that implement the interface ASTVisitor preferably using the type java.util.Set.

CLASSES WHERE +IsAbstract
AND ImplementedInterfaces CONTAINS {+org.eclipse.jdt.core.dom.ASTVisitor}
AND UsedTypes CONTAINS {java.util.Set}

Or how about classes that contain deprecated methods?

CLASSES WHERE
CONTAINS METHOD WERE ANNOTATED WITH {java.lang.Deprecated}

The list goes on. And we don't stop at classes and methods. Sometimes we would like to bring in more context information. For instance a typical question a developer sometimes asks is how other developers handled a certain exception. Did they close the stream afterwards or did others just log the exception?

CATCHBLOCKS WHERE CaughtType IS {+java.io.IOException}

What we can express here is a question many of us ask themselves over and over again. What have other developers done in my situation?

Repositories
The quality of the results is dependent mainly on the quality and the volume of the search index. The public index will consist of many open source types (http://eclipse.org/recommenders/documentation/completion/) that we can offer code examples from. A dilemma arises when we think about precious code that lies dormant in hundreds of types and thousands of methods in the developer's company repository. These sources are most likely not open source and hence can't be put into a public index. The solution is a private search index that is build and stored inside the company infrastructure. The query can then be performed on the public as well as the local search index.

IDE integration
Queries of this kind can then already be used easily from inside the Eclipse IDE. The editor assists with things such as the query grammar and resolving of types. The goal is to make it as efficient as possible to express a query that reflects what the developer searches for. While a query of this kind can easily be created by hand its full potential is still not exploited. In many use cases where the developer would like to find code examples, the query will consist of information that reflect the user's current code context. That might be the interfaces the current class implements, the overridden method we are in and the uninitialized type that we desperately need an instance for. Most of this information the query consists of are usually ones that the IDE could provide. So why shouldn't it? By abstracting further and bundling the search into a few easy to understand search types the query complexity is now hidden underneath plain proposals in the Eclipse content assist popup window. Let's consider the following situation.

IStatusLineManager slm = null;
|<^Space>

We just declared an IStatusLineManager and set it to null. One of the proposal would probably be a search for code that initializes this type in some way and might find code that for example creates an actual instance of IStatusLineManager or just a method that returns it. Do we now copy the whole method or even the whole class? Or do we just reuse an already existing method that we now knows exists? It's up to us the developer how to use the code examples that are found.

How does that sound. Like something that could make your life easier? Let us know.