Google App Engine Datastore Composite Index Zigzag Merge Join

August 20, 2018

When a entity model class is defined, basic property class such as StringProperty, IntergerProperty, DateTimeProperty, etc are indexed by default (you can disable it by using indexed=False). TextProperty is not indexed.

class Read(Model):
    posted_date = ndb.DateTimeProperty()
    name = ndb.StringProperty(indexed=False)
    description = ndb.TextProperty()
    parent_key = ndb.KeyProperty()
    is_active = ndb.BooelanProperty(default=True)

Only indexed property can be queried.

items = Read.query(Read.is_active == True).fetch()

NOTE: if you declare a new property after created N items, the new property query doesn’t work for old items until you save/put the old items.

When you need to perform sorting on the queried result, you would need to declare composite index.

Edit index.yaml.

- kind: Read
  properties:
  - name: is_active
  - name: posted_date
    direction: desc
items = Read.query(Read.is_active == True).order(-Read.posted_date).fetch()

NOTE: To make sure all the necessary composite index are declared during development (no auto generation), use dev_appserver.py –require_indexes.

When you need query Read.is_active and Read.parent_key, you have 2 options.

Option 1: Create a full index. The benefit of this index is best performance.

- kind: Read
  properties:
  - name: is_active
  - name: parent_key
  - name: posted_date
    direction: desc

It only works for the following query (both query property must be specified).

items = Read.query(Read.is_active == True, Read.parent_key == key).order(-Read.posted_date).fetch()

Options 2: Create 2 partial index. The performance is slower as it might involve multiple query using zigzag merge join algorithm.

- kind: Read
  properties:
  - name: is_active
  - name: posted_date
    direction: desc

- kind: Read
  properties:
  - name: parent_key
  - name: posted_date
    direction: desc    

An extra benefit of option 2 is that all the following query are supported.

# query by is_active
items = Read.query(Read.is_active == True).order(-Read.posted_date).fetch()

# query by parent_ke
items = Read.query(Read.parent_key == key).order(-Read.posted_date).fetch()

# query by is_active and parent_key
items = Read.query(Read.is_active == True, Read.parent_key == key).order(-Read.posted_date).fetch()

For best performance and flexibility, you can declare all 3 composite index.

- kind: Read
  properties:
  - name: is_active
  - name: posted_date
    direction: desc

- kind: Read
  properties:
  - name: parent_key
  - name: posted_date
    direction: desc 

- kind: Read
  properties:
  - name: is_active
  - name: parent_key
  - name: posted_date
    direction: desc

There are also other considerations (besides performance and flexibility) such as index limit and size. Read the references section for a better understanding.

NOTE: It is advisable not to declare composite index which you don’t intent to use.

References:

This work is licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License.