formats module

The classes in this module encode and decode posting information for a field. The field format essentially determines what information is stored about each occurance of a term.

Base class

class whoosh.formats.Format(field_boost=1.0, **options)

Abstract base class representing a storage format for a field or vector. Format objects are responsible for writing and reading the low-level representation of a field. It controls what kind/level of information to store about the indexed fields.

Parameters:field_boost – A constant boost factor to scale to the score of all queries matching terms in this field.
decode_as(astype, valuestring)

Interprets the encoded value string as ‘astype’, where ‘astype’ is for example “frequency” or “positions”. This object must have a corresponding decode_<astype>() method.

decoder(name)

Returns the bound method for interpreting value as ‘name’, where ‘name’ is for example “frequency” or “positions”. This object must have a corresponding Format.decode_<name>() method.

supports(name)

Returns True if this format supports interpreting its posting value as ‘name’ (e.g. “frequency” or “positions”).

word_values(value, analyzer, **kwargs)

Takes the text value to be indexed and yields a series of (“tokentext”, frequency, weight, valuestring) tuples, where frequency is the number of times “tokentext” appeared in the value, weight is the weight (a float usually equal to frequency in the absence of per-term boosts) and valuestring is encoded field-specific posting value for the token. For example, in a Frequency format, the value string would be the same as frequency; in a Positions format, the value string would encode a list of token positions at which “tokentext” occured.

Parameters:
  • value – The unicode text to index.
  • analyzer – The analyzer to use to process the text.

Formats

class whoosh.formats.Existence(field_boost=1.0, **options)

Only indexes whether a given term occurred in a given document; it does not store frequencies or positions. This is useful for fields that should be searchable but not scorable, such as file path.

Supports: frequency, weight (always reports frequency = 1).

class whoosh.formats.Frequency(field_boost=1.0, boost_as_freq=False, **options)

Stores frequency information for each posting.

Supports: frequency, weight.

Parameters:field_boost – A constant boost factor to scale to the score of all queries matching terms in this field.
class whoosh.formats.Positions(field_boost=1.0, **options)

Stores position information in each posting, to allow phrase searching and “near” queries.

Supports: frequency, weight, positions, position_boosts (always reports position boost = 1.0).

Parameters:field_boost – A constant boost factor to scale to the score of all queries matching terms in this field.
class whoosh.formats.Characters(field_boost=1.0, **options)

Stores token position and character start and end information for each posting.

Supports: frequency, weight, positions, position_boosts (always reports position boost = 1.0), characters.

Parameters:field_boost – A constant boost factor to scale to the score of all queries matching terms in this field.
class whoosh.formats.PositionBoosts(field_boost=1.0, **options)

A format that stores positions and per-position boost information in each posting.

Supports: frequency, weight, positions, position_boosts.

Parameters:field_boost – A constant boost factor to scale to the score of all queries matching terms in this field.
class whoosh.formats.CharacterBoosts(field_boost=1.0, **options)

A format that stores positions, character start and end, and per-position boost information in each posting.

Supports: frequency, weight, positions, position_boosts, characters, character_boosts.

Parameters:field_boost – A constant boost factor to scale to the score of all queries matching terms in this field.