At Gap Intelligence my team deals with a lot of data. To be specific we have market intelligence data with an historical estimate of over 200 million rows. With all of this data, I realized that we were quickly running into bottlenecks with our API especially as it is one of the main ways our clients consume our data, as well as what powers Gap's internal and external applications and offerings.
For some background, our application is written in Ruby on Rails. For the first iteration of our API, we used Rails's ORM called Active Record. It comes with every rails application and is amazing for fast development, however it comes at the price of performance. In this blog I will show you how to leverage ElasticSearch to speed up querying our data, and maybe yours too.
pricings_controller.rb
def index
@pricings = Pricing
filtering_params(params).each do |key, value|
@pricings = @pricings.public_send("by_#{key}", value) if value.present?
end
@pricings = @pricings.includes(:merchant).order(order).page(page).per(per_page)
render json: @pricings, serializer: ::V1::PaginationSerializer
end
private
def filtering_params(params)
params.slice(:category_name, :part_number)
end
This is a pretty straightforward approach. We are looping through the allowed filtering parameters passed into the index action. For this to work we must have model methods defined, such as by_category_name & by_part_number. In our case, we made model scopes and they can get chained together if multiple parameters get passed in. Below you can see these scopes defined in our pricing model.
We are also utilizing pagination to limit the amount of data the API returns in the JSON response. Lastly, the index action will render the JSON using a serializer. We use the Active Model Serializers (AMS) gem. One of the reasons we chose to use AMS is to take advantage of the JsonApi Adapter, which allows us to easily format our JSON using the JSON API specification (jsonapi.org/format). In the controller code, we are going through a PaginationSerializer which is shared by all serializers. It dynamically figures out which model serializer to use based on the model. In this case, the Pricing Serializer you see below.
pricing.rb
class Pricing < ActiveRecord::Base
scope :by_category_name, -> (categories) { joins(:category).where("LOWER(categories.name) IN (?)", categories.downcase.split(',')) }
scope :by_part_number, -> (part_numbers) { joins(:product).where("LOWER(products.part_number) IN (?)", part_numbers.downcase.split(',')) }
...
end
pricing_serializer.rb
class V1::PricingSerializer < V1::BaseSerializer
attributes :id
attributes :date_collected
attributes :shelf_price
attributes :in_stock
attributes :merchant
attributes :product
def merchant
object.merchant.name
end
def product
object.product.name
end
end
So our API works and returns us the data we need. Awesome! However, our pricing model has millions of records in it. This means when we query against the database it can be quite slow, especially when paginating deeper into the data set. This is where you will see some really slow response times. Performance is critical to any API. You want your API to return data quickly. With performance being so important we decided to give ElasticSearch a shot.
We decided to use a gem called Searchkick. Searchkick is a gem that runs on top of ElasticSearch and makes it simple to search your ElasticSearch data in a Rails-like fashion.
First things first, we need to decide what data we are going to store in the search index. In Searchkick you must implement the search_data method in your model. For example in the pricing model:
pricing.rb
def search_data
{
category_name: category.name.downcase,
part_number: part_number
}
end
After you have defined what data is indexed, you simply call Pricing.reindex to index your data. Keep in mind every time you change the search_data method you must reindex the data. One way to speed up the indexing is to eager load your associations. In Searchkick you can define the search_import scope like this:
pricing.rb
scope :search_import, -> { includes(:merchant, :category, :product) }
OK, we have our data indexed. Now how do we query ES to get the data we need? Searchkick comes with some nice & easy ways of searching the data, but as your searches become more advanced, it's recommended to use the Elasticsearch DSL, which Searchkick conveniently fully supports. So with that let's get into the controller logic.
pricings_controller.rb
def index
@pricings =
Pricing.search(
include: [:category, :merchant, :product],
query: query,
order: order,
page: page,
per_page: per_page
)
render json: @pricings, serializer: ::V1::PaginationSerializer
end
private
def build_query
query_array = []
filtering_params.each do |key, value|
if value.present?
query_array << Pricing.public_send("search_by_#{key}", value)
end
end
query_array
end
def filtering_params
params.slice(:part_number, :category_name)
end
def query
{ bool: { must: build_query } }
end
As you can see there is some familiar code here. We still have the filtering_params method. We are calling slightly different methods as we loop through the parameters. These methods are defined in a model concern and included in the model. For example:
pricing.rb
class Pricing < ActiveRecord::Base
include PricingSearchable
end
Moving the ElasticSearch logic into a concern
pricing_searchable.rb
module PricingSearchable
extend ActiveSupport::Concern
include Searchable
included do
after_save :reindex
scope :search_import, -> { includes(:merchant, :category, :product) }
def search_data
{
category_name: category.name.downcase,
part_number: part_numbers
}
end
end
module ClassMethods
def search_by_category_name(category_names)
{ terms: { category_name: category_names.downcase.split(',') } }
end
def search_by_part_number(part_numbers)
{ terms: { part_number: part_numbers.downcase.split(',') } }
end
end
end
I like that all the ES logic is in a concern now, which keeps the pricing model cleaner.
As for building the ES query, we are using what is called a Bool Query. This is a query that matches documents matching boolean combinations of other queries. With each boolean clause, you must set a typed occurrence, in this case ‘must’. This means all those query clauses ‘must' appear in matching documents. So for example if I wanted to search by a category of ‘TVs’ and a part number of 123 then only documents which match both conditions are returned in the query results. For more information on the query DSL, visit the ES documentation.
The rest of the parameters we pass to the search method are pretty straightforward. The included parameter is for eager loading of related models. The order parameter is for sorting the result set. Keep in mind the data you want to sort by needs to be added to the ES index. And lastly, we have the pagination parameters page & per_page.
For our performance testing, we wanted to see how well the API performed when trying to paginate deep into a large result set. I wrote a script to hit a specific page in the result set and got an average of over 3 requests. Below is a table displaying how many milliseconds each request took and it’s obvious how much it improved when leveraging ES in our API. This was a huge win for us to see these performance gains.
Pagination Page | API w/ActiveRecord | API w/ElasticSearch |
---|---|---|
100 | 895ms | 110ms |
1,000 | 1425ms | 221ms |
1,0000 | 1899ms | 301ms |
100,000 | 5106ms | 2601ms |
200,000 | 6620ms | 3683ms |
300,000 | 8282ms | 3293ms |
400,000 | 10202ms | 3525ms |
500,000 | 12515ms | 3923ms |