Build it and they will come? Performant Search Part 2: The Technology Sauce For Better Spaghetti

Our job was to find a long term scalable solution to the problem of finding your activities that match your key word search. This post pertains to the technology involved. Read about product features and new capabilities here.

Turns out, search in RescueTime is a surprisingly complicated problem, for the simple fact that your typical search prioritizes ranked relevance results– it’s ok for the engine to stream results to you, it’s ok for the search to be incomplete, as long as the ranking is bubbling up the best match. It’s ok for it to be probabilistic or best-guess in nature, sometimes. Generally speaking, you are looking for something small (some words) in something large (a web page).

Our challenge, is that while the user experience semantically matches “search”, what you really need is a complete result set, combining all relevant reporting filters, of activities that match your requested search expression. It should produce repeatable results assuming new data hasn’t arrived. It should be real-time updatable as your new data comes in (~ every 3 minutes). It should be ok if every record or zero records match, there can be no cap on number of matches. All this, for about 100-400 logs of time data per user per day for many tens of thousands of users. The longer a user is with us, the huger the list of activities to match against, just for that user. The list of unique activities of our users is well over 1 billion. We should be able to completely rebuild indexes of activities at will, in near real time, to provide application improvements. Yet, cost needs to scale at worst linearly as a fraction of the rest of our products’ demands, and speed needs to remain constant or better.

All of these characteristics eventually killed our previous attempts to use industry-standard search models based first on Lucene/solr, and secondly on Sphinx. Both were able to hold up for a period of time, but were fundamentally flawed solutions in the assumptions they make expecting a relatively static, ranked-match document-based search model. To shoehorn them into our needs required operational spaghetti and some pretty ugly reporting query structures.

Our old search platform may have looked like this and required similar engineer attention, but it didn’t sound as good.

Enter MySQL fulltext Boolean search. First there is the critical advantage of being *inside* and native to our database platform in a way that even Sphinx with it’s plugin can’t be. This allows for more integrated and simpler reporting queries– no longer is the step of matching your search expression required to be isolated from the reporting query that depends on it (Sphinx could have done this, sort of, with the plugin, but not really the same). Second, in Boolean search mode, MySQL provides an unlimited result set (no cap on results). Additionally, there is less duplication of supporting data, since it can operate entirely in the same instance as the source data– this value is not to be underestimated, for all the inherent caching this leverages. Operationally, it is far easier to dynamically and programmatically add, destroy, and rebuild indexes– since they behave like tables with normal indexes to the operator.

But for performance, the  most critical options it offered was a way to fluidly and transparently provide per-user-account search indexes, which lets our performance remain consistent despite constant multi-dimensional growth (new users + accruing existing users’ time data). This isolation-of-index solution would have been theoretically possible but horribly unwieldy and in need of huge operational supporting code in the other options. Secondly, it provides a clear way to constrain the size of keyword indexes, in other words, we know from your requested report you could only possibly care about activities that were in the particular time range you requested, and this can be of value both in index partitioning options and the submitted report query itself, especially in the amount of memory that must be held to build final results. A huge benefit of this known-maximum-scope for the searchable data means that at any time we can intelligently but programmatically throw away or update whatever dynamic table plus index we had that intersects the requested scope rather than the entire source tables, and rebuild it in real time, for a minor speed penalty (< 0.01 sec vs .1 to 3 sec for your search). Any subsequent search request that remains a subset of that most recently persisted scope can just reuse the current index, with the < 0.01 sec speed. We can play with permitted scope expansion to tune for speed. Furthermore, any sharding of accounts across new instances allows the search data to naturally follow or be rebuilt inline with the same scaling decision that drove the sharding to begin with– no separate stack to worry about following the shard logic.

Check out some example code for sneaking a search result into a series of joins rather than hanging out in the WHERE clause. Here it can be treated just like any other join, and like any other filter on your pivot reporting.

[sourcecode]
– check dynamically_maintained_text_and_things_with_controllable_scope
– if exists and is up to date continue,
– else, create if needed and
– push missing scope by intersecting request scope with existing
– index is maintained by table definition

SELECT * from things
INNER JOIN other_things on things.id = other_things.thing_id
– begin search acting like join
INNER JOIN (
SELECT * FROM ( SELECT things_id
FROM dynamically_maintained_text_and_things_with_controllable_scope `things_search_data`
WHERE MATCH (things_search_data.a_text_field, things_search_data.another_text_field)
AGAINST (‘search words here -there’ IN BOOLEAN MODE)
) d_things_result_table
) things_result_table
ON things_result_table.thing_id = other_things.thing_id
– end search acting like join
WHERE
other_things.limiting_characteristic = a_factor
[/sourcecode]

We’re using the table alias for the search data there because that would allow you to chain multiple searches, one after the other (search inside a search), against the some source data by adjusting its table alias.

Engineers looking to provide high performance, easily maintainable search capabilities should carefully consider MySQL fulltext search options if they are on a late version of the product. This is especially true if you are trying to manage a search index of billions of tiny items based on database text columns that mutate very quickly, and benefit from programmatically created and maintained indexes. For example, it occurs to me that Twitter could use this model in very much the same way we do to provide some powerful realtime search indexes for targeted subsets of accounts or hashtags.

In our case, we were able to reduce our search related platform costs by about 60-70%, significantly reduce our operational pain, even while delivering a solution that provided vastly improved performance and eliminating that “well, it works now but we’ll just re-architect next year” problem.

Our spaghetti search code and virtual spaghetti cabling has now been re-factored into something tastier, ready for your consumption.

creative commons credit to Eun Byeol on flickr


Build it and they will come? Performant Search brings Flexible Reports Part 1: Key Word Filtering works!

Our job was to find a long term scalable solution to the problem of Searchable Time. This post discusses our search capability and some ways to use it, now that we have reliable and speedy access to this feature. There will be a follow up post presenting the technology chosen, for those interested.

RescueTime has three features that depend on what we are calling “search”, I will be presenting two of them here: using keywords and expressions as a reporting filter with the “Search” field, and the Custom Report module (the third is “hints” in projects time entry interface).

I’ve been putting “search” in quotes (though I’ll stop that affectation now) because what we’re doing here is a bit different than a traditional Google-style search. We’re giving you a way to see a view of your RescueTime history across any span of time you choose, pivoted on your perspective of interest, eg. Categories or Activity Details or Productivity, for any activity we find that matches your search request. A “Custom Report” is just a way to save a search query for repeated use. But what does this all mean?

If you take a moment and think about it, this filtering can be very powerful. If you pick a good set of keywords, and possibly some tweaking with logical expressions (more on that later), you can get a fascinating view across your history, regardless of category, productivity, or other classification that is focused in high resolution at particular project, client, or other meme that might appear in many different applications and websites. How much time did you spend dealing with “John”? or, what is my pattern of time spent in a console versus my text editor (“terminal iterm aquamacs sublime vim”)?

Consider your document names, or folder names, email addresses, chat identities, and websites as potential members of a search expression to build these reports. The search engine will also understand logical AND and NOT and nesting. The default relationship between words is OR.

Let’s consider another example: How much did the last mini-release cost us?

You’ve got a team working on a project codenamed “Piranha”. This name appears in code filenames and directories, or Eclipse project names. It appears, with a little discipline, in your email subjects. And your support ticketing and requirements tracking system. And your marketing material’s files and web pages. And your internal chat group. And your meetings entered via offline time tracker. You get the idea– we can give a total time cost of this project, with 0 (zero) data entry across your entire organization . Well, plus any time your team spent learning about piranhas on Wikipedia (pick smart project names for best results, use logical operators to help out, eg “piranha NOT wikipedia NOT vimeo). You can then save this as a Custom Report for ongoing metrics, and side by side comparison with other ongoing custom reports.

Thank you to all our customers for sticking with us and giving feedback during the iteration of this slightly magical tool. We think search is finally fully operational.


How’s your Friday? Can you beat this?

I couldn’t resist a quick post noting The Onion’s entrée into at-the-desk analytics…

Office Cheering On Employee Going For 32-Minute Nonstop Work Streak

Unfortunately, since he wasn’t running RescueTime he couldn’t prove it or share it with posterity. How many of you have bested this stellar performance today?


RescueTime for ChromeOS: Getting Productivity in the Cloud

For individual or team users, RescueTime for ChromeOS brings automated productivity management and app use statistics to ChromeBook and ChromeBox users. Get it here: RescueTime for ChromeOS to get productive today.

Google is very much on to something with ChromeOS. This system paradigm is pretty much inevitable, in some flavor, for students, schools, and the office desks. While power users may scoff at its limitations (I used to be a scoffer), the features available are constantly pushing forward– and the benefits of this totally reconsidered platform are clear.

Data always backed to the cloud. Everything encrypted. Inoculated against hardware hacking. Remotely wipeable. Centralized app asset management, user management, etc. Google really leap-frogged Microsoft on this, who should have seen this coming. It’s all those features people expect a mobile device to have, but done better, and for the desktop.

All these feature make the platform ideal for offices or schools needing a low total cost operating system, especially if just about everything the user needs to do is already on a network based service. And for those applications that aren’t, using the Chrome extension development APIs make porting access over a low cost effort.

Which brings us to RescueTime for ChromeOS. When working on our plugin update, we realized with relatively little extra work, RescueTime could deliver a full ChromeOS app for online and offline use.

Schools and businesses using the team version get the benefit of adoption and user engagement measurements– including top used apps, and most active groups in their organizations.

Just like the regular desktop app, RescueTime for ChromeOS will give the same detailed but automatic time statistics, seamlessly integrated with any other device running RescueTime, including your Android phone or tablet. Google users running RescueTime are now able to get 100% coverage of their digital life for distraction and productivity management.


Introducing time management in real time: RescueTime for Chrome and Firefox

Keeping track of how you are spending your time is great, but if it takes too long to track down your productivity stats, it can become a distraction on it’s own. We wanted a way to effortlessly record how we are spending our days in a way that we could easily access with a single click. So, we built RescueTime for Chrome and Firefox.

This new browser extension brings our full time-management app into your browser, and gives you a more “live” interaction with your activities as they unfold – feeding you current stats about the site you have open, as well as information about your entire day.

Highlights:

  • Easily see how much total time you’ve spent online today
  • Compare that time to your average over the past two weeks
  • Understand how productive your time has been.
  • See how long you’ve been on the current website (or other sites similar to the current website)
  • Works with a new or existing RescueTime account

 

You can use the same account to track time across multiple browsers (on multiple computers, even). If you’ve got the RescueTime desktop application installed, all of your computer time will be reflected in the graphs. Just make sure you have the “I already have RescueTime on this computer” setting checked in the browser plugin.

We think it’s a great addition to the RescueTime family. We hope you’ll get as much out of it as we have.

Get RescueTime for your Web Browser today.


Import your Wakoopa data to RescueTime

Hi folks,

We’ve just released a tool that lets you send your Wakoopa data export to RescueTime for processing in to your account.

Wakoopa users: signup with 30% off and import your Wakoopa history.

The uploader is made available from the “product” section of the “account” screen of settings on our site, here is a direct link: Import time logs from other services [here]. You can upload the plain export file, or you can compress it with gzip or zip first. Here is Wakoopa’s blog post concerning their timeline: http://blog.wakoopa.com/#timeline, consult the email they sent you for instructions on getting your export. If we can get a copy of these instructions we will add the info here as well.

There are 4 steps to import:

  1. Make a discounted RescueTime account (2 weeks before any charge, downgrade anytime)
  2. Upload your import file
  3. Convert the data
  4. Import the time to your RescueTime account

If there’s any problem, contact us!

 

Import Tool


A New Report and Search / Key Word Filtering Improvements

tldr; We’ve added a “Activity Details” report, that presents your normal graph views of rank, over time, and productivity of all your “detailed” activities, or documents. This view is particularly useful combined with the Search word filtering tool, which now has improved results matching.

Search Improvements: Foundation for Our New Report

Our previously announced search improvements were primarily targeted at dramatic speed gains and increased reliability. As this new infrastructure stabilized, we took a careful look at how key word filtering was working for users, and considered the great feedback our users provided in conjunction with our own analysis. Then, over the last few weeks, we’ve been tuning how we can better and more intuitively match against user’s requested search parameters– for example, if you are in Word and have a long file path for the current document you are editing, the ability to search by directory, filename, file type, application type, etc. We’ve arrived now at what seems like an effective general solution for smart indexing– but we will continue to examine the results and take feedback on how it can be further improved. All together, the much improved speed combined with the more accurate results provided us the opportunity to integrate a new report into our offering, one that makes keyword reports particularly useful for exploring how you spend your life on the computer.

A side note about terms: I use both the terms “search” and “key word filtering” due to the dual purposes of this capability. We find that predominantly users use this feature (and it’s persisted cousin, Custom Reports) for the purpose of generated reports filtered against desired match results: we call this use “key word filtering” because rather than trying to find something, you are trying to generate a filtered report with a sort of ad hoc grouping. However, users also sometimes use this feature simply to aggregate / locate time for a specific item: this is the “search” use. Finally, there is a semantic case to be made that, in general, web app users are more inclined to understand at first glance what goes in this field when it is labeled “Search”, despite that not really being its primary purpose.

A New Report: The Activity Details Report for Premium Users

A much asked for feature, our new Activity Details report provides an immediate view into the time you spend on your most urgent items, no matter what application or site you are on. If you are tracking your time in a project or for a customer, or want to understand, for example, how email time figures against your favorite design or engineering tool, this is a great resource. You can filter it with keywords to narrow down the view, and you’ll get reports that graph the top documents or pages, and a table that lets you see your app/site plus its documents. Critically, before it was impossible to see all the results of search filters in one view: you could never see matching documents/details from different apps and sites expanded together, and now you can. The old activity report is now called the Activity Summary report if you want a less noisy summary.

Navigation Changes: Integrating Search / Key Word Filtering into Regular Use

In conjunction with the above changes, we’ve rationalized how the site navigation responds to your searches and key word filtering. Again, this is an attempt to combine our own analysis with your feedback, and may be tweaked over time.

  • From any report view, a new search (as in, clicking the search button) lands you on the Activity Details page. This provides you with immediate feedback for quality of your search results. Non-premium users still land on the Activities Summary page.
  • Clicking search on the dashboard leaves you on the dashboard, with filtered results
  • Once a search is active, it becomes sticky: if you navigate to Time reports using the side navigation, or click the “view complete report” links on dashboard widgets, your current search filter is preserved
  • When viewing Activity reports, the search filter is preserved for Application or Site items linked in the table results. For example, if you search for keywords like “Seattle Atlanta”, you get a list of all apps and sites that have either of those words in their name or document details; if GMail was in the results, and you clicked the item “GMail” anywhere it occurs in the app/site column, you would get a report of all GMail items with the same keywords in its subjects and details.
  • To clear out a search filter from affecting your Time Report browsing, simply click “Clear Search Filter [x]” and your screen is reloaded with the filter applied, and it is removed from all navigation points.
  • Note: at this time links click *inside* the graphs themselves are not preserving the keywords, we’re continuing to explore sensible behavior for this case.

Thanks for your patience and feedback as we improve RescueTime!