Build it and they will come? Performant Search Part 2: The Technology Sauce For Better Spaghetti

Our job was to find a long term scalable solution to the problem of finding your activities that match your key word search. This post pertains to the technology involved. Read about product features and new capabilities here.

Turns out, search in RescueTime is a surprisingly complicated problem, for the simple fact that your typical search prioritizes ranked relevance results– it’s ok for the engine to stream results to you, it’s ok for the search to be incomplete, as long as the ranking is bubbling up the best match. It’s ok for it to be probabilistic or best-guess in nature, sometimes. Generally speaking, you are looking for something small (some words) in something large (a web page).

Our challenge, is that while the user experience semantically matches “search”, what you really need is a complete result set, combining all relevant reporting filters, of activities that match your requested search expression. It should produce repeatable results assuming new data hasn’t arrived. It should be real-time updatable as your new data comes in (~ every 3 minutes). It should be ok if every record or zero records match, there can be no cap on number of matches. All this, for about 100-400 logs of time data per user per day for many tens of thousands of users. The longer a user is with us, the huger the list of activities to match against, just for that user. The list of unique activities of our users is well over 1 billion. We should be able to completely rebuild indexes of activities at will, in near real time, to provide application improvements. Yet, cost needs to scale at worst linearly as a fraction of the rest of our products’ demands, and speed needs to remain constant or better.

All of these characteristics eventually killed our previous attempts to use industry-standard search models based first on Lucene/solr, and secondly on Sphinx. Both were able to hold up for a period of time, but were fundamentally flawed solutions in the assumptions they make expecting a relatively static, ranked-match document-based search model. To shoehorn them into our needs required operational spaghetti and some pretty ugly reporting query structures.

Our old search platform may have looked like this and required similar engineer attention, but it didn’t sound as good.

Enter MySQL fulltext Boolean search. First there is the critical advantage of being *inside* and native to our database platform in a way that even Sphinx with it’s plugin can’t be. This allows for more integrated and simpler reporting queries– no longer is the step of matching your search expression required to be isolated from the reporting query that depends on it (Sphinx could have done this, sort of, with the plugin, but not really the same). Second, in Boolean search mode, MySQL provides an unlimited result set (no cap on results). Additionally, there is less duplication of supporting data, since it can operate entirely in the same instance as the source data– this value is not to be underestimated, for all the inherent caching this leverages. Operationally, it is far easier to dynamically and programmatically add, destroy, and rebuild indexes– since they behave like tables with normal indexes to the operator.

But for performance, the  most critical options it offered was a way to fluidly and transparently provide per-user-account search indexes, which lets our performance remain consistent despite constant multi-dimensional growth (new users + accruing existing users’ time data). This isolation-of-index solution would have been theoretically possible but horribly unwieldy and in need of huge operational supporting code in the other options. Secondly, it provides a clear way to constrain the size of keyword indexes, in other words, we know from your requested report you could only possibly care about activities that were in the particular time range you requested, and this can be of value both in index partitioning options and the submitted report query itself, especially in the amount of memory that must be held to build final results. A huge benefit of this known-maximum-scope for the searchable data means that at any time we can intelligently but programmatically throw away or update whatever dynamic table plus index we had that intersects the requested scope rather than the entire source tables, and rebuild it in real time, for a minor speed penalty (< 0.01 sec vs .1 to 3 sec for your search). Any subsequent search request that remains a subset of that most recently persisted scope can just reuse the current index, with the < 0.01 sec speed. We can play with permitted scope expansion to tune for speed. Furthermore, any sharding of accounts across new instances allows the search data to naturally follow or be rebuilt inline with the same scaling decision that drove the sharding to begin with– no separate stack to worry about following the shard logic.

Check out some example code for sneaking a search result into a series of joins rather than hanging out in the WHERE clause. Here it can be treated just like any other join, and like any other filter on your pivot reporting.

[sourcecode]
– check dynamically_maintained_text_and_things_with_controllable_scope
– if exists and is up to date continue,
– else, create if needed and
– push missing scope by intersecting request scope with existing
– index is maintained by table definition

SELECT * from things
INNER JOIN other_things on things.id = other_things.thing_id
– begin search acting like join
INNER JOIN (
SELECT * FROM ( SELECT things_id
FROM dynamically_maintained_text_and_things_with_controllable_scope `things_search_data`
WHERE MATCH (things_search_data.a_text_field, things_search_data.another_text_field)
AGAINST (‘search words here -there’ IN BOOLEAN MODE)
) d_things_result_table
) things_result_table
ON things_result_table.thing_id = other_things.thing_id
– end search acting like join
WHERE
other_things.limiting_characteristic = a_factor
[/sourcecode]

We’re using the table alias for the search data there because that would allow you to chain multiple searches, one after the other (search inside a search), against the some source data by adjusting its table alias.

Engineers looking to provide high performance, easily maintainable search capabilities should carefully consider MySQL fulltext search options if they are on a late version of the product. This is especially true if you are trying to manage a search index of billions of tiny items based on database text columns that mutate very quickly, and benefit from programmatically created and maintained indexes. For example, it occurs to me that Twitter could use this model in very much the same way we do to provide some powerful realtime search indexes for targeted subsets of accounts or hashtags.

In our case, we were able to reduce our search related platform costs by about 60-70%, significantly reduce our operational pain, even while delivering a solution that provided vastly improved performance and eliminating that “well, it works now but we’ll just re-architect next year” problem.

Our spaghetti search code and virtual spaghetti cabling has now been re-factored into something tastier, ready for your consumption.

creative commons credit to Eun Byeol on flickr


Obama racks up karma and blows up servers with his Reddit AMA

President Obama made an appearance on Reddit this last Wednesday with an AMA (Ask Me Anything), answering questions from users for thirty minutes. At the peak of the event, there were over 198,000 users who were attempting to view the President’s AMA.

In just a couple of hours, the President gained over 17K points of comment karma. Based on stats from the RescueTime user base, he also had a pretty significant impact on Reddit’s server load (and other users’ page load times).

We saw that many users spent more time viewing Reddit’s heavy load outage page than they actually spent on the AMA page itself.

Reddit's Heavy Load Outage Page

We looked at the aggregate user’s Reddit time broken down between viewing the outage page and successfully viewing the AMA page. Here’s what that data looks like:

Graph of time spent on Outage page vs Obama's AMA page

The guys at Reddit posted a blog today giving some other great statistics regarding the event. They added a total of 60 servers to try and keep up with demand, but still had problems due to their load balancers not being able to keep up.

Overall this was a great opportunity for Reddit and the community at large, I would love to see on-going AMA sessions with the President on a monthly basis (regardless of who that ends up being!). Perhaps as an addition to the “Weekly address”?


How we use RescueTime, at RescueTime.

We built RescueTime because we thought it should be easier to make sure we’re spending our time the way we want to. It has opened up a whole new world of data for us, and we wanted to share some of the ways we make use of it around the office.

Forming a baseline lets us read the pulse of our team at a glance.

RescueTime lets us see how much time we’re spending on the computer, without having to keep time sheets or manual logs. By categorizing different applications and websites, we can get a pretty good sense of how much time we’re spending on productive stuff vs. unproductive stuff. That gets really cool when you have enough data for patterns to jump out. It also makes it really easy to see when something weird (not necessarily bad) happens.

Take the month of April, for example:

It’s clear that something is very different about the first week, and that it appears something odd happened on the last day of the month. Turns out two of us were out of town during the first week, and on the last day we were trying to make some end-of-the-month deadlines.

Working 9-5? Not us.

UPDATE: Here’s a post from our CEO explaining the “5-hr productive rule” in much better detail.

We completely got rid of having set working hours. After looking at a couple months of our data, we decided that 5 hours of productive time per day is a pretty good average. We set up an alert for RescueTime to let us know when we’ve reached that 5 hour mark, and use that instead of a set hourly schedule. This flexibility works out really well for us (especially considering we’re a semi-distributed team). We still make sure there are a few hours in the day that we’re all available at the same time, but aside from that it’s up to each person to decide when they want to work.

Meetings at exactly the right time.

We use RescueTime’s efficiency report and comparisons report to figure out what times we’re the most productive and then never schedule a meeting on top it. Since meetings can be a bit of a distraction anyways, we try to reserve meetings for the times of day when we’re already a little bit scattered to begin with.

Unfortunately, it’s not totally homogenous, some people are more focused in the mornings and others in the afternoon. Having that extra context is still a huge help, though. (For example, I won’t go near a meeting on Tuesday afternoons, which is when I’m the most focused.)

It’s not just for team-wide decisions, either.

Those are a few ways that RescueTime impacts our entire team. Individually, we use our RescueTime data in all sorts of ways. Here’s a few highlights:

Robby (Product Development / Design):

“I use the time reports along with some metrics from other systems to figure out how long it takes me to do certain things. For instance, I can pretty easily tell that I spend just over 11 minutes on each customer support request I deal with (on average). I’d really like to bring that number down, and its easier to do now that I have a visible baseline.”

Joe (CEO / does a little bit of everything):

“I find it invaluable being able to know how long I’ve been working on a specific task. By being able to search for an individual document, like “linux_extended_info_grabber_sqlite.cpp”, I can see that I spent 4 hours 16 minutes so far this week working on that coding that feature. In that same search, I can see how much time others in my group have spent on that same document. Being able to look back at this type of data is amazingly powerful for me. It helps me estimate times better, judge overall effort, and make better business decisions.”

Mark (Chief Architect):

“I find value in surprisingly specific ways. For example, sometimes I will compare time spent in terminal applications versus code editors, to confirm or dispute the emotional feeling that I’m dragging and thrashing due to too much incremental testing (evinced by excessive terminal / shell time). And, probably unlike others, I’ll sometimes respond to my communications / email category to being too little of my time, and force myself to re-engage some lagging communications efforts.”

Jason (Sales / Marketing)

“I love having the Offline time prompt. It is motivating to me to keep me working, and it allows me to enter valuable time spent on the phone or Skype with customers. By categorizing all of my time, even unproductive time, it provides me with a clear picture across the top on how I’m performing that day, versus my best day and how many hours a day I average.”

It’s probably worth noting that anything in this post could be applied in either a team setting or for a single individual.

How are you using RescueTime?


Google Doodle Strikes Again! 5.35 million hours strummed

Happy birthday to guitar legend, Les Paul who would have been 96 today!

Google launched its Les Paul Tribute Doodle on Thursday which allowed users to record and playback friend’s recordings or just play along. Almost immediately you could hear the rumblings on Twitter and Facebook about this great musician’s effect on music and many aspiring musicians - lots of people were doing their best Les Paul impersonations all the while Google entertained the masses. As one Twitter user said, “he’s the REAL guitar hero!”

Here’s what we saw on Thursday:

Les Paul Doodle by Google

We immediately saw tweets coming in for the trending tag #lespaul at a speed of 20 tweets/sec or more while writing this article. With each major online publication commenting and recommending the Les Paul Doodle, traffic was way up and people kept talking all day!

Les Paul Google Doodle

Along with the excited buzz surrounding the Les Paul Doodle, there plenty of tweets regarding the loss in productivity that it was causing. We saw hundreds of tweets similar like:

Tweet about Productivity

Last year we reported on the effect of Google’s playable Pac-Man Doodle, so as a follow up we ran the numbers to see if the Les Paul Doodle consumed a similar amount of users time.

We looked at ~18,500 random RescueTime users who spent time on Google search pages. In previous time periods, users spent a very consistent 4.5 minutes (+/- 3 seconds) actively using Google search. However, when Google released the Les Paul Doodle, the average user spent 26 seconds MORE on Google.com than in previous time periods. On average, users spent 36 more seconds time on last year’s Pac-Man Doodle, so you would think that the Les Paul Doodle had less impact. Wrong. According to Wolfram Alpha and Alexa, Google’s daily unique vistor count is up to 740 million versus the 505 million last year.

  • Google’s Les Paul Doodle consumed an additional 5,350,789 hours of time versus the 4,819,352 hours consumed by the Pac-Man Doodle
  • $133,769,725 is the dollar tally, If the average Google user has a COST of $25/hr (note that cost is 1.3 – 2.0 X pay rate)
  • Users did not spend much more total time at their computer than previous periods, but they did spend 10% more time at Google’s website than they typically would, meaning that the 10% more time spent at Google was stolen from other computer use time
We are already looking forward to the next interactive Doodle and wondering how it will stack up.

About the data:
RescueTime provides a time management tool to allow individuals and businesses to track their time and attention to see where their days go (and to help them get more productive!). We have hundreds of millions of man hours of second-by-second attention data from hundreds of thousands of users around the world, tracking both inside and outside the browser. The data for this report was compiled from roughly 18,500 randomly selected Google users.

About our software:
If you want to see how productive you are vs the rest of our users, you should check out our service. We offer both individual and group plans (pricing starts at FREE).


Does the World Cup matter?

Tweet ItEvidently there are plenty of hooligans in my neighborhood looking for an excuse to start drinking and yelling at a TV around noon in my favorite pub.  This was a little surprising to me, since I live in a yuppy downtown Seattle neighborhood which is full of software geeks and otherwise respectable people.

Now that it’s all over with, I decided to see if there was a broader trend in RescueTime’s data. Time spent on the computer dropped about 4% and productive time dropped a full 10% here in the US on the day of our first game vs England.  More people than usual checked the news, which managed to grab a 5% bump despite the drop in total time.  Evidently no one was watching the game on their computers, since Entertainment (including sports) stayed flat.

The effect was even more pronounced in the UK.  Productive time dropped 13%, total time dropped 7%, and instead of reading about the upset in the news like their American counter parts, the English were apparently watching it live with an 5% bump in Entertainment.

All that’s interesting, but that game took place on a Saturday, when most people aren’t supposed to be working anyway.  When the US squeaked out a tie during the final minutes of their next match against Slovenia on Friday, our American users spent a little more time than normal on the news, but it wasn’t enough to cause a significant change in productivity.

Here is a graph of all the days of the World Cup, compared to a typical week* to help see if there was real trend here.

It’s obvious that productive time was consistently down during the entire World Cup.  The US’s game dates are circled in red.  It’s interesting that you can see after we were eliminated by Ghana, things picked up a bit, but still didn’t quite make it back to normal.  This might be because we have more international users than members in the US.  Total time spent on computers was down 4% and productive time was down 3% over all the working days in the tournament.

There are a couple other interesting points in that graph, particularly the 18% drops in productivity over Fathers Day and Fourth of July weekend.  People seemed to come back pretty slowly after the 4th, and didn’t manage to get back into full swing until the end of the week.

When you look at it from RescueTime’s perspective, it’s pretty clear that the world cup does matter.

*A typical week is the average from the 28 days before the World Cup began (Memorial Day was tossed out).

RescueTime provides a time management tool to allow individuals and businesses to track their time and attention to see where their days go (and to help them get more productive!). We have hundreds of millions of man hours of second-by-second attention data from hundreds of thousands of users around the world, tracking in real time both inside and outside the browser.


Google is eating Microsoft’s lunch, one tasty bite at a time.

Tweet It

Microsoft just launched Office 2010 to great fanfare, and quietly slipped in a new free online version.  It looks like they may have finally realized that if they don’t cannibalize their core business with a web based offering, Google will.  Has the sleeping giant over in Redmond finally awoken, and can they defend their biggest cash cow from the future?

Some analysts say Google’s online offering can’t compete with Microsoft’s.  They have no idea.

Google Apps vs Micrsoft Office Daily Reach

We’ve been tracking the usage habits of hundreds of thousands of our users over the last two years, and you can clearly see that Google has managed to increase their daily reach from around 59% to 79%.  On the other hand, Microsoft Office has been steadily shedding users, losing about 9% of our population.

To get an idea of how relatively important each application in these suites are, here is a graph showing the full gamut.

Communication makes up about 18% of all computer usage.  Google proved you could do email in the cloud not only competitively, but for free.  Outlook and Gmail dominate these two companies’ suites in terms of unique daily users.  Gmail managed to increase their slice of the pie about 3%, while Outlook lost about 6% of the total.  That’s a 21% relative decline for Microsoft vs 7% relative growth for Google in arguably the single most important software sector.  Microsoft loses its integration advantage when people stop using big pieces of the suite, which may help explain the synergistic decline of Outlook and Excel.  It’s also interesting to note that Word and PowerPoint have been relegated to a tiny fraction of our users who seem to greatly prefer Google Docs.

If that was the whole story, things might look pretty grim for Redmond, and it’s no wonder they’re being forced to respond to web based offerings.  However, there is at least one more way to consider the data, and that’s in terms of the amount of time spent in particular applications, not just the number of people using them.

It’s clear again that email is the most important component in both companies portfolios, but even though Gmail has about double the users, the smaller population of Outlook users spend more time emailing than all the Gmail users put together.  Today, Outlook is the preferred weapon of choice for heavy users, but if I were an exec at Microsoft, I’d be paying very close attention to the direction those blue and red lines move from here on out.  You might also notice that in terms of spreadsheet usage, there is really only 1 option.

About the data:
RescueTime provides a time management tool to allow individuals and businesses to track their time and attention to see where their days go (and to help them get more productive!). We have hundreds of millions of man hours of second-by-second attention data from hundreds of thousands of users around the world, tracking in real time both inside and outside the browser.  We selected annual date boundaries for this set, to help reveal seasonal variations in usage, like the holiday dip in productivity.

About our software:
If you want to see how productive you are vs the rest of our users, you should check out our product tour. We offer both individual and group plans (pricing starts at FREE).



The Tragic Cost of Google Pac-Man – 4.82 million hours

When Google launched its Pac-Man logo on Friday, we immediately heard amused groans in our tweet-streams. “Well, so much for my morning,” said one. “Google’s Pac Man logo just ruined millions of dollars in productivity today, nationwide,” said another.

Here’s what we all saw on Friday:

Here are two of the tweets we saw in response:

Given our repository of hundreds of millions of man hours of second by second attention data, we figured there’s no one better than RescueTime to tell the world about the cost of Google Pac-Man on that fateful Friday. Here’s what we learned.

The first thing to understand is that Google does not result in a lot of active usage, in terms of time. Yes, we all use Google. But a Google search only requires a few seconds, and we’re all pretty well trained to click one of the first few links. Add to that the fact that many people use Google as a navigation tool (“Googling “IBM” instead of typing in “www.ibm.com”). Nonetheless, it might surprise you that our average Google user spends only 4 and a half active minutes on Google search per day, spread over about 22 page views. That’s roughly 11 seconds of attention invested in each Google page view. It doesn’t sound like a lot, but next time you do a search, count to 11- it’s a long time.

This weekend, we took a hard look at Pac-Man D-Day and compared it with previous Fridays (before and After Google’s recent redesign) and found some noticeable differences. We took a random subset of our users (about 11,000 people spending about 3 million seconds on Google that day) The average user spent 36 seconds MORE on Google.com on Friday.. Thankfully, Google tossed out the logo with pretty low “perceived affordance” – they put an “insert coin” button next to the search button, but I imagine most users missed that. In fact, I’d wager that 75% of the people who saw the logo had no idea that you could actually play it. Which the world should be thankful for.

If we take Wolfram Alpha at its word, Google had about 504,703,000 unique visitors on May 23. If we assume that our userbase is representative, that means:

  • Google Pac-Man consumed 4,819,352 hours of time (beyond the 33.6m daily man hours of attention that Google Search gets in a given day)
  • $120,483,800 is the dollar tally, If the average Google user has a COST of $25/hr (note that cost is 1.3 – 2.0 X pay rate).
  • For that same cost, you could hire all 19,835 google employees, from Larry and Sergey down to their janitors, and get 6 weeks of their time. Imagine what you could build with that army of man power.
  • $298,803,988 is the dollar tally if all of the Pac-Man players had an approximate cost of the average Google employee.

I hope you’ve enjoyed our Pac-Man data journey as much as we have. Next up in our on our data-hacking list, we’ll be digging in to find the laziest and most productive countries and cities in the world. Where do you think yours ranks?

About the data:
RescueTime provides a time management tool to allow individuals and businesses to track their time and attention to see where their days go (and to help them get more productive!). We have hundreds of millions of man hours of second-by-second attention data from hundreds of thousands of users around the world, tracking both inside and outside the browser. The data for this report was compiled from 11,000 randomly selected Google users.

About our software:
If you want to see how productive you are vs the rest of our users, you should check out our service. We offer both individual and group plans (pricing starts at FREE).