Dec 2, 2011

How should code search work?

Hey all. As you probably all know, this is the Code Recommenders project blog we typically use to announce new features, releases or other noteworthy topics related to Code Recommenders. This blog post is a bit different than the others: It's our first "guest blog post" written by Tobias Boehm, a master student doing his master thesis in the scope of the Code Recommenders project.
This post is basically a brain dump how he thinks code search engines like Google Code Search, Krugle, or Koders *should* work and how we should be able to use them. He introduces an early prototype of a code search query language (which he will implement using Xtext) and a client tightly integrate into the Eclipse IDE. His work will be based on the previous code search engine and Eclipse client we already blogged about a few months ago here: "Why is Google Codesearch not google for code search?"


This blog post is a "heads-up! Your feedback is wanted!" post. So, please do not hesitate to ask tough questions or provide any other kind of feedback. All kind of feedback is appreciated and will help Tobias to catch idea bugs early! If you are interested in joining the work on code search engines, get in contact via the Code Recommenders forum. We are looking forward to your feedback. 


Thanks,
Marcel



How should code search work?
We all know that learning an API is hard. But we do it day by day by day...  When learning a new framework we often ask things like "Which classes should I extend?", "Which methods should I invoke on this object?", "How can I create an instance of this particular type?", "How does the code of others look like that is similar to mine?". Hopefully, some documentation is available that answers these kinds of questions. But how many frameworks do you know that have such excellent documentation? If you know a few: How many do you know which have not?

Let's assume that there is some documentation somewhere. Who wants to work through heaps of documents just to know how obtain an instance of the famous IStatusLineManager? In my opinion, the best possible documentation already exists. It's existing and tested code which is available in masses in code repositories where these API are actually used, the classes are extended, the methods called and the objects actually instantiated. The code is there, it just has to be found!

But searching for source code is still a tedious task. Although many search engines exists and even Google has a product targeted at code search - they are not too useful in certain situations. First it seems most of them treat source code the way web search engines treat websites they index - as plain text. While that might make sense for some code search use cases - it is not enough for most others. Moreover they are too generic to be useful. Most code search engines seem to identify the programming language the code is written in, yet they are not using the language-specific semantic that lies underneath.

And then there is "availability". For a developer to be able to search for source code this source code has to be indexed by the search engine. So the code available is always restricted to open source code publicly available in the web with all personal and company repositories being unused.

Lastly, there is no IDE integration. What that means is that every time we want to issue a query we are shifting focus away from our IDE to a website.

In this blog post, I describe may plans on how to implement a Code Search engine with Apache Lucene. I'll go through a set of sample queries and explain what get's indexed and how developers can query the index to solve common day-to-day tasks.

If you are interested in code search and are maybe seeking for good alternative to (almost closed) Google Code Search or want to build your own code search engine for your own company - please continue reading and don't hesitate to ask questions about it here or in the Code Recommenders forum.


Query
As said above, search query capabilities of todays code search engines are somewhat limited. Code Recommenders might come to rescue here. The heart of this prototype currently in development is a novel query language. This query language must be very simple to use. We want to create queries very easily. Yet it must be so powerful that we can express all the requests we might have to a code base. What might these requests be? Before we dig into the search criteria let's take a step back and think about what it is we would like to find. Are we interested in source files? Probably not. Java developers don't think in source files. At the bottom level we think in classes, methods and maybe even smaller blocks of code. Then these are probably the units of code we want to find. Now what questions might a developer have? She might for example want to find methods that have a certain name. The query will look like the following.

METHODS WHERE Name IS "set.*"

This query will return those methods with a name starting with "set". Let's say we are interested in methods that add something to a java.util.Set.

METHODS WHERE CalledMethods CONTAINS {java.util.Set.add}

By combining multiple search criteria and using negation the developer is able to refine the query to get exactly what she needs.

METHODS WHERE ReturnType IS {org.eclipse.jface.action.IStatusLineManager}
AND +IsPublic AND !IsStatic

A Query that will search for public, non-static methods that should return an instance of IStatusLineManager. When using the prefix "+" and "!" the developer can explicitly mark criteria as mandatory or non-occurring. If we omit the prefix the condition is optional and the results we get might not meet the criteria. In this particular example we would want the query to look like this.

METHODS WHERE ReturnType IS {+org.eclipse.jface.action.IStatusLineManager}
AND +IsPublic AND !IsStatic

This way we can be sure that the methods we find will return IStatusLineManager. What if we - for whatever reason - are interested in public methods that use an IStatusLineManager, are annotated with SuppressWarnings and that should be constructors? Here's the query for that.

METHODS WHERE IsConstructor
AND UsedTypes CONTAINS {+org.eclipse.jface.action.IStatusLineManager}
AND +IsPublic
AND ANNOTATED WITH {+java.lang.SuppressWarnings}

This is really just supposed to be an example of how detailed we can get. There are many more criteria available and they are not bound to just methods. Many of the criteria applicable to methods can be applied to classes too. We might as well search for classes with a certain name.

CLASSES WHERE Name IS "set.*"

With a more complex query for example we search for abstract classes that implement the interface ASTVisitor preferably using the type java.util.Set.

CLASSES WHERE +IsAbstract
AND ImplementedInterfaces CONTAINS {+org.eclipse.jdt.core.dom.ASTVisitor}
AND UsedTypes CONTAINS {java.util.Set}

Or how about classes that contain deprecated methods?

CLASSES WHERE
CONTAINS METHOD WERE ANNOTATED WITH {java.lang.Deprecated}

The list goes on. And we don't stop at classes and methods. Sometimes we would like to bring in more context information. For instance a typical question a developer sometimes asks is how other developers handled a certain exception. Did they close the stream afterwards or did others just log the exception?

CATCHBLOCKS WHERE CaughtType IS {+java.io.IOException}

What we can express here is a question many of us ask themselves over and over again. What have other developers done in my situation?

Repositories
The quality of the results is dependent mainly on the quality and the volume of the search index. The public index will consist of many open source types (http://eclipse.org/recommenders/documentation/completion/) that we can offer code examples from. A dilemma arises when we think about precious code that lies dormant in hundreds of types and thousands of methods in the developer's company repository. These sources are most likely not open source and hence can't be put into a public index. The solution is a private search index that is build and stored inside the company infrastructure. The query can then be performed on the public as well as the local search index.

IDE integration
Queries of this kind can then already be used easily from inside the Eclipse IDE. The editor assists with things such as the query grammar and resolving of types. The goal is to make it as efficient as possible to express a query that reflects what the developer searches for. While a query of this kind can easily be created by hand its full potential is still not exploited. In many use cases where the developer would like to find code examples, the query will consist of information that reflect the user's current code context. That might be the interfaces the current class implements, the overridden method we are in and the uninitialized type that we desperately need an instance for. Most of this information the query consists of are usually ones that the IDE could provide. So why shouldn't it? By abstracting further and bundling the search into a few easy to understand search types the query complexity is now hidden underneath plain proposals in the Eclipse content assist popup window. Let's consider the following situation.

IStatusLineManager slm = null;
|<^Space>

We just declared an IStatusLineManager and set it to null. One of the proposal would probably be a search for code that initializes this type in some way and might find code that for example creates an actual instance of IStatusLineManager or just a method that returns it. Do we now copy the whole method or even the whole class? Or do we just reuse an already existing method that we now knows exists? It's up to us the developer how to use the code examples that are found.

How does that sound. Like something that could make your life easier? Let us know.

18 comments:

  1. Syntax is SO ugly

    ReplyDelete
  2. This sounds awful.

    Just one more way for bad programmers to cut and paste their way to being out of a job in six months.

    ReplyDelete
  3. Hmm. I like your idea, or at least your problem statement, but I don't think we need another query syntax to learn. Maybe bring in some Amazon like recommendations as I edit the code: "developers who have started with Object[] often refactor to List<Object>...".

    ReplyDelete
  4. I think one of the problems with source code search is that there's a TON of example code that's not in any repo -- it's on tutorial or question/answer sites like StackOverflow. Often these are the most useful ways to find information about unknown code, but they're tough to search for because most web search engines strip special characters.

    The only search engine I'm aware of that doesn't strip special characters is SymbolHound: http://symbolhound.com/

    full disclosure: I am a co-founder and developer of SymbolHound. I just launched it a few weeks ago. Currently you may notice that StackOverflow is overrepresented in search results, but that should change as the index size grows.

    ReplyDelete
  5. I feel this is a nice idea. Yes, the syntax could be better as one of the commenter has mentioned above.

    Unlike the other commenter it doesn't bother me that some people can use it unethically. No matter what you invent, there is always a good chance that some people are going to used it in wrong ways. That shouldn't prevent us from inventing something that is going to be good and useful for many others.

    ReplyDelete
  6. Thanks for the comments so far.

    @Anonymous "ugly syntax". Agreed, there is some room for improvement. But the query language is just a front end to the search engine. What matters is what can your search for. The language (graphical or textual) is second to this.

    @John Haugeland I go with susam here. Yes, one can misuse these tools. You never search source code? Not even with grep? :)

    @David Crane, yes there is a lot of code not in repositories. But there is also a lot of code in them. Eclipse Marketplace has 20GB, Maven central 120GB, sourceforge.net, google code, bitbucket, github... well, it should be good enough for most queries, right? There is also an option to run such tools in-house on your own code base... However, standard websearch is also an interesting approach. Good luck with symbolhound.com

    @susam (and all others that commented so far): Thanks. We'll improve the query language based on your thoughts.

    ReplyDelete
  7. Really interresting, but I think the query language is quite unusable by itself. For simple cases, eclise already does better (find Types, Type hierarchy, call tree). For complex cases, who want to program and debug a complex query for an hour... Just to find the right code to write? Chances are googling for a tutorial is going to be faster.

    But if it serve as the basis to make IDE more "intelligent" this is good.

    Second interresting point and maybe consequence for me is the separation of the IDE and the program code base / index / query engine.


    On huge code base, today querying is slow, even the "basic" eclipse features. Compiling take hours.

    Doing this on every every IDE instance for every computer is good up to some code size.

    To manage that we make module, use jar and maven, but ultimately we loose the possibility to properly query/debug the whole code base, introducing more integration errors... because we miss the big picture.

    But with an external server/farm compiling and indexing the code offline, doing it for several version of the code and connected to an IDE would provide a very rich, interresting - and fast - environement to use for development.

    ReplyDelete
  8. I had a similar idea for a command line tool, a "semantic grep" if you will. Querying arbitrary programming languages by converting its AST to a generic one that is queryable. I started an initial prototype in Haskell, but it only consists of parsers and AST converters for C and Python so far. No query language yet: https://github.com/jb55/semgrep

    Ideally something like this could be used as an autocompletion tool/code search for arbitrary programming languages. That's the goal anyway.

    ReplyDelete
  9. Have you seen jexamples.com (http://jexamples.com). This site provides a semantically correct search engine that mines open source projects for Java source code examples.

    ReplyDelete
  10. Anonymous, to summarize: the language is worse but you like the idea to have your own code search engine for in-house projects (integrated into your IDE?

    Bill, thats a good goal. Language independence is the way to go - finally. But at the moment I tend to believe we could strive to make a good search engine per language and then generalize :) Anyway, good luck with you project!

    Chris, this post is not so much about existing code search engines. It's about the query language and the integration of code search engines into the IDE. Jexamples QL is simple ( http://jexamples.com/queryhelp.html) I wonder if we could do more? On the long run, however, it would be great to integrate existing engines like JExample into the IDE - maybe with a common query language.

    To summarize the feedback we received so far here, on reddit, twitter and gplus:

    This QL is not good enough. It seems that people don't like SQL like languages for code search in general.

    Ideas go from simple QLs we know from websearch like

    "calls: method1(), Type.method2 extends:ClassB"

    to lambda-like QLs:

    "
    Maybe prototype a syntax focused towards lambdas?
    In C#/Java it would look like: var privateSetters = methods.Where(m => m.Name == "Set" && m.Visibility == "Private");
    "

    to code template-like query languages:

    Query 3:
    public static IStatusLineManager *(*){
    *
    }

    Query 4:
    @SuppressWarnings(*)
    public *(*){
    USES IStatusLineManager
    }

    * stands for anything

    The more I think about it, I tend to believe that there will be more than one front-end to the search engine. I think, it makes sense to define an exhaustive set of arbitrary complex search queries in natural language first, then specify the index data structure next to answer these queries, and *then* specify a set of QLs that can be used interchangeably to query the index.

    Thanks a lot for your comments, so far. It's an interesting topic with many opinions and a challenging research area (see http://resuite.org/suite/2012 that deals with such kind topics).

    ReplyDelete
  11. You have done great work on Google search code work.This really very benefited to me.Just saying thanks will not have of good clarity on your thoughts

    ReplyDelete
  12. Nice idea. As a masters student a while ago, I worked on a project called jQuery (not THE jquery, another jQuery that predates the current one http://jquery.cs.ubc.ca/). Anyway, this project treated your eclipse workspace as database that you could query over and provided a nice UI to help view the queries.

    I like your approach that you are expanding the database to include source code from everywhere.

    Although I agree that your query language isn't beautiful, I'd recommend that you not spend too long making it perfect. As you gain more experience with the language, you'll be able to learn more about the kinds of queries that people like to do. And rather than force users to create a query from scratch, you can provide some nice UI to help them build the kind of query that they want (while hiding the actual syntax).

    We found this to be a reasonable approach with jQuery. Good luck.

    ps- the project is no longer maintained and I don't know if it still works in recent versions of Eclipse.

    ReplyDelete
  13. Thanks, Andrew. Tobias is making good progress on indexing a local workspace. I think he'll have a version that offers a simple Lucene query front-end ready in January. the exact QL (better several QLs) comes next.

    Deepak is also big fan of visual QL. We'll find out ;) I'll look at JQuery's UI to get some good ideas. Thanks for the pointer.

    ReplyDelete
  14. Something popped to my mind when I read this: Refactoring.

    When wading through legacy code, I can find tons of "anti patterns" which badly need fixing. Despite the fact that these patterns repeat, there is no refactoring for them. Your code query would be one building block for a tool to do code search&*replace*.

    As for syntax: Try to avoid using any available symbol on the keyboard.

    What I'd like to see:

    METHODS WHERE Name LIKE "set*"
    METHODS WHERE Name LIKE /set.*/

    METHODS WHERE CalledMethods CONTAINS ( ... )

    METHODS WHERE ReturnType = org.eclipse.jface.action.IStatusLineManager
    AND IsPublic AND NOT IsStatic

    (or AND !IsStatic)

    I don't see what you gain by the + prefix unless there is also a - prefix. "!" isn't the opposite of "+".

    AND ANNOTATED WITH ( java.lang.SuppressWarnings )

    (braces should be optional if there is just one value). What did the "+" do here???

    From the Java grammar, it's obvious that annotations and interfaces are sets. No point to expose this again in the QL.

    Repositories should allow cascading. A usual setup will be "project, company index, public index"

    ReplyDelete
  15. Excellent thoughts Aaron. Thanks for your time spent on digging into this. At http://wiki.eclipse.org/Recommenders/CodeSearch we currently assemble a list of "search questions" a human might ask Siri (if Siri were a code search engine). Feel invited to put all kinds of queries that come into you mind there.
    There is no need for concrete syntax. Tobias will use these queries in January to test the power of his search index and test the expressiveness of all QLs he (or we!) might come up with. Consider it as the beginning of a benchmark for code search query languages...

    Thanks,
    Marcel

    ReplyDelete
  16. As i go through your post, I found this blog to be excellent. Great write-up. Keep it up..

    ReplyDelete
  17. The blog is absolutely fantastic. Lots of great information and inspiration, both of which we all need. Thanks.

    ReplyDelete
  18. It's an interesting topic with many opinions and a challenging research area.
    For More Information

    ReplyDelete