Jul 5, 2010

Why is Google Codesearch not 'google for code search'?

Whenever we are searching some information in the web, we use Google to find it. Google is actually so popular that we often use ‘google’ as a synonym for searching the web – and sometimes we ask ourselves: “Who did we ask before Google?”

Google works amazingly well when searching the web for articles or documentation etc. However, with the large scale availability of open source code, developers started to search for example code using Google. Although text documents too, applying the same techniques (like word stemming, word splitting etc.) on source code doesn’t work too well.  Therefore in October 2006 Google announced their own search engine specially designed to find source code:  Google Codesearch.  And with Google Codesearch a plenty of other code search engines appeared on the screen like Koders, Krugle, and ByteMyCode - just to name a few.

However, although considered being extremely useful, we had the feeling that code search engines still aren’t used by the majority of software developers. But why? For that reason we started some research on what may hinder the widespread use of such search engines. This post is about some (more or less obvious) findings and a new prototype (re)search engine that aims to solve some of the issues of existing approaches.

Issues with existing code search engines

#1: Usage of plain text-based information retrieval techniques.

The most critical issue with current code search engines (CSEs for short) is that they treat source code still more like text documents than like code. Of course, CSEs adopted their stemming algorithms and phrase detections etc. to better fit source code syntax, but these algorithms still ignore the large amount of structured information that is available in source code. For illustration, consider the code snippet below taken from the Eclipse JavaSnippetEditor. What information can you extract from this piece of source code?

protected void showStatus(String message) {
  IEditorSite site=(IEditorSite)getSite(); 
  site.getActionBarContributor().getActionBars()
      .getStatusLineManager().setMessage(message);
}

Well, basically only the literals used in the snippet. You can extract the names of the methods and variables used and the simple type names given in the method or variable declaration. But how can you determine what kind of object the method getActionBars() is working on? You can’t – at least not from looking at this snippet or its enclosing Java source file only. This information can only be resolved if you have a detailed knowledge about the classpath this piece of code will be executed with and that’s what the compiler does for us: When compiling the source code to binary it resolves all these implicit dependencies using the classpath and thus knows exactly about the types and methods used (this is more or less correct since it only knows about the static types but not the runtime types – but it’s far more knowledge than we get by just looking on the source code of a single file). Without that information we loose most of the structural information in code. To compensate this, CSEs allow us to use regular expressions to get at least some meaning back into our queries. However, regex aren’t a perfect compensation of what we have lost…

This little example reveals the most fundamental problem of current CSEs: They just look in the source code without resolving the implicit (method and type) bindings. For sure, current CSE have to address many other challenges but we think that knowing about your types, methods and even inheritance hierarchies is important to find the best matching code examples. If you are interested to learn how for instance Krugle implemented its search engine I recommend you to get a copy of Lucene In Action Ed. 2. Here Krugle developers wrote a case study chapter about how they implemented their search engine using Apache Lucene and which challenges and design decisions they had to make…

#2: Ignorance of prior knowledge.


A second issue of current CSEs is their ignorance of the developer’s prior knowledge. Consider for instance a developer who is very familiar with SWT and JFace but not familiar with, say, Eclipse Help. A code search engine that presents to him hundreds of examples that use SWT and JFace are less helpful to him than examples that actually use Eclipse Help – these examples carry some new knowledge for him, and thus are much more interesting than code that does not contain any new information.

Consequently, CSEs should leverage the personal knowledge of a developer to build personalized search results. This is, however, difficult when using a web browser to submit and display query results. But it becomes much easier when we integrate this process into the IDE. We will detail on that later.

#3: Almost no IDE integration.

When using CSEs we typically have to leave the IDE and enter our code search query manually into the webpage’s search field. This is disturbing and - in conjunction with the limited query language support (using simple terms and regular expressions for search) - makes it very hard to for developers to specify their query.

Here, approaches are needed that automatically extract the relevant search terms from the code the developer is currently working , automatically sends the query and presents the results within the IDE. Such a tool would greatly ease the use of CSEs.

#4: Scoring of example code is based on words rather than on semantics.


Current scoring of example code works comparable to the scoring of plain text documents. However, as said above, code contains much more structure than text. For instance all used types, all implemented interfaces, all superclasses, overridden methods, called methods etc. Since all existing approaches do not (at least rarely) leverage this structure their current scoring mechanisms are rather simple. But when the structure in code is preserved during search, new, much more complex scoring functions become available.

In the end of this post we will present an approach how CSEs could learn what is actually relevant for a given user or query by looking at how users interact with the search results – and in response to this user feedback updates its scoring function to produce better results next time it is used.

Four issues. How can we solve them?

Let’s start with a simple example how searching for example code from inside your IDE looks like with our current prototype:

Let’s assume we created a new Eclipse View by extending the class ViewPart class as depicted in the figure below.  Let’s assume further that we want to update the view’s status line using an IStatusLineManager whenever we do some background computation. The problem here is how can we obtain a handle on IStatusLineManager (cf. line 18)?




When using a CSE like Google Codesearch or Krugle we have to enter some keywords on the web page’s search field on our own like “IStatusLineManger setMessage” and hope that we get som appropriate code snippets showing how to obtain such a reference. However, when doing so, you get probably something like this:




Which you may or may not find very helpful.


How could a fully integrated code search look like then?

Well, first of all it could create the query fully automatically from the class code itself or just a user selected portion. The figure below highlights various information that could be added automatically to the query when searching for example classes similar to the whole class:



At the end the query contains the information about the overridden methods, the extended classes, the declared fields, used types and methods (IStatusLineManager and IStatusLineManager#setText() for instance).

Next the tool sends this query to the server and displays the results inside the IDE as shown below. The developer now browses the example recommendations and double clicks on those examples he find interesting to look on them in detail. In this case the second example might be very interesting since it contains a method call to IActionBars#getStatusLineManager() which is very likely to contain the information we are actually looking for:

The Code Examples Results view gives a very brief summary of each code snippet. It shows both the types and methods used within the code snippet, and gives some inheritance information like the name of the super-class and implemented interfaces where appropriate.  It serves like an abstract of the class.

As said above, example #2 looks quite interesting and the developer may decide to open the source code by double-clicking on the summary.  Next, a java editor opens showing the source code of the example but also highlights the relevant statements in code and put some markers on the right bar to indicate other relevant places in code:



From here, he may copy the interesting lines into his code. All within ~30 seconds :-)



That’s the prototype UI implementation of our code search engine. To summarize, it’s most important features are:

  1. Automatic query creation from the current editor’s content (whole classes or selected lines)
  2. Sending the query and displaying its results directly inside the IDE
  3. Uses structure preserving queries that uses fully qualified method and type names (requires a special indexing structure on server-side too)
  4. Condensed code summaries
  5. Opens source code with relevant statement highlighting and markers to indicate interesting locations in code.

The UI is, however, just one side of the coin and probably the most challenging part is the backend that is actually performing the search. Although important, I tend think that details on how the backend is actually implemented is not that interesting here. However, there is one important feature I want to discuss: Learning what has been important to continuously improve the search results.

Best results always on top
How to continuously improving the ranking


As we said before, code contains much more structure than actually leveraged by current CSEs. With the availability of such a rich feature set we have to decide which features we consider to be more important than others so that at the end the best examples are ranked on top. Since we take into account different information in code (like the inheritance information, used methods and types etc.) we currently have a set of ~20 different features we use to score a code example. But determining the best weights for each of these features is rather difficult; tweaking one parameter here may affect another parameter there, and thus finally may result in a configuration worse than before.

To overcome this need a machine learning algorithm that learns the optimal parameters/weights for our scoring function. The intuition behind this algorithm is what I want to explain in a nutshell.

Consider that a developer issued a query that returned six results as shown below. Consider further that the user gave some feedback about the quality of these examples by explicitly saying that example #1 and #4 was actually very helpful, example #3 wasn’t helpful, example #6 was somehow helpful and example #2 and #5 weren’t looked at.

How could such a feedback be used to improve the quality of the search engine? Well, we could use this feedback to learn what the user actually looked at and liked or disliked. Therefore we create a partial ranking of the examples that were deemed helpful by the user, look at the features that caused the examples to be ranked that high in the actual ranking and automatically tweak the weights for these features so that the search engine would have produced an optimal ranking where all useful examples where ranked on top and the less useful ones at the bottom.

Or in really simplified: If a user clicked on the second result and ignored the first one (based on a not-interesting summary) we want the next time example #2 to appear before example #1:

This is one of the core features of the backend we implemented: A ranking engine that learns what users actually find helpful and adopts the weights/parameters of the scoring function according to the user’s implicit (or explicit) feedback. This approach could also be used to for personalized preferences and many other things and together with a (purely local) personal knowledge tracker new personalized code search engines become possible. This will be one of our most interesting future work areas :-)

Wrap up


In this post I presented the first prototype of our code search engine, an engine that learns what the user actually liked or disliked and leverages this knowledge to improve its own ranking capabilities over time. Furthermore, we outlined how structure in code could be leveraged to improve code search and how all these features could be seamlessly integrated into the Eclipse IDE.
 
Disclaimer :-)


This (re)search engine is ready to use and available for download (http://recommenders1.st.informatik.tu-darmstadt.de/updates/eclipse3.6/). It currently works with some initially trained weights but will improve itself over time. If you like – give it a try. Its examples database consists of all source code file of the Eclipse 3.6 (RC4) classic edition, and thus might be helpful for Eclipse plug-in developers.

As always, we would love to get your opinions on this tool and whether you feel such a code search engine could serve you better than Google code (or not - which is also a valid finding. Then your arguments would be very interesting :) ). Whatever comments you have – let us know! This engine is build to stay and to be continuously improved – for free. If you like, submit your ideas or issues to our issue tracker at eclipselabs.org and point us to the right search engine for software developers.

It might be that we are wrong with our perception of how good code search engines actually are. Tell us about your opinion by taking this 2 minutes survey: http://recommenders1.st.informatik.tu-darmstadt.de/survey/index.php?sid=91172 This would help much in learning what features you find important and which features you don't like...

For more details on the project itself visit our homepage.


All the best,
Marcel 

Credits

Most of the work in this project is done by volunteers, and same is true for the code search engine. The search engine, its intelligent scoring function and basic UI has been developed by Peter Schroeder as part of his master thesis. The UI is currently improved by Sheip Dargutev and Nikolay Shindov. Thanks for your work!