DBD::SQLite::Fulltext_search (3)
Leading comments
Automatically generated by Pod::Man 4.09 (Pod::Simple 3.35) Standard preamble: ========================================================================
NAME
DBD::SQLite::Fulltext_search - Using fulltext searches with DBD::SQLiteDESCRIPTION
Introduction
SQLite is bundled with an extension module called ``The first full-text search modules for SQLite were called "FTS1" and "FTS2" and are now obsolete. The latest version is "FTS4", but it shares many features with the former module "FTS3", which is why parts of the
Short example
Here is a very short example of using
$dbh->do(<<"") or die DBI::errstr; CREATE VIRTUAL TABLE fts_example USING fts4(content) my $sth = $dbh->prepare("INSERT INTO fts_example(content) VALUES (?)"); $sth->execute($_) foreach @docs_to_insert; my $results = $dbh->selectall_arrayref(<<""); SELECT docid, snippet(fts_example) FROM fts_example WHERE content MATCH 'foo'
The key points in this example are :
- *
-
The syntax for creating FTStables is
CREATE VIRTUAL TABLE <table_name> USING fts4(<columns>)
where "<columns>" is a list of column names. Columns may be typed, but the type information is ignored. If no columns are specified, the default is a single column named "content". In addition,
FTStables have an implicit column called "docid" (or also "rowid") for numbering the stored documents. - *
- Statements for inserting, updating or deleting records use the same syntax as for regular SQLite tables.
- *
- Full-text searches are specified with the "MATCH" operator, and an operand which may be a single word, a word prefix ending with '*', a list of words, a ``phrase query'' in double quotes, or a boolean combination of the above.
- *
- The builtin function "snippet(...)" builds a formatted excerpt of the document text, where the words pertaining to the query are highlighted.
There are many more details to building and searching
QUERY SYNTAX
Here are some explanation aboutToken or token prefix queries
An
-- Virtual table declaration CREATE VIRTUAL TABLE docs USING fts3(title, body); -- Query for all documents containing the term "linux": SELECT * FROM docs WHERE docs MATCH 'linux'; -- Query for all documents containing a term with the prefix "lin". SELECT * FROM docs WHERE docs MATCH 'lin*';
If a search token (on the right-hand side of the
Column specifications
Normally, a token or token prefix query is matched against the
-- Query the database for documents for which the term "linux" appears in -- the document title, and the term "problems" appears in either the title -- or body of the document. SELECT * FROM docs WHERE docs MATCH 'title:linux problems'; -- Query the database for documents for which the term "linux" appears in -- the document title, and the term "driver" appears in the body of the document -- ("driver" may also appear in the title, but this alone will not satisfy the. -- query criteria). SELECT * FROM docs WHERE body MATCH 'title:linux driver';
Phrase queries
A phrase query is a query that retrieves all documents that contain a nominated set of terms or term prefixes in a specified order with no intervening tokens. Phrase queries are specified by enclosing a space separated sequence of terms or term prefixes in double quotes ("). For example:
-- Query for all documents that contain the phrase "linux applications". SELECT * FROM docs WHERE docs MATCH '"linux applications"'; -- Query for all documents that contain a phrase that matches "lin* app*". -- As well as "linux applications", this will match common phrases such -- as "linoleum appliances" or "link apprentice". SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
NEAR queries.
A
-- Virtual table declaration. CREATE VIRTUAL TABLE docs USING fts4(); -- Virtual table data. INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system'); -- Search for a document that contains the terms "sqlite" and "database" with -- not more than 10 intervening terms. This matches the only document in -- table docs (since there are only six terms between "SQLite" and "database" -- in the document). SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database'; -- Search for a document that contains the terms "sqlite" and "database" with -- not more than 6 intervening terms. This also matches the only document in -- table docs. Note that the order in which the terms appear in the document -- does not have to be the same as the order in which they appear in the query. SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite'; -- Search for a document that contains the terms "sqlite" and "database" with -- not more than 5 intervening terms. This query matches no documents. SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite'; -- Search for a document that contains the phrase "ACID compliant" and the term -- "database" with not more than 2 terms separating the two. This matches the -- document stored in table docs. SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"'; -- Search for a document that contains the phrase "ACID compliant" and the term -- "sqlite" with not more than 2 terms separating the two. This also matches -- the only document stored in table docs. SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
More than one
-- The following query selects documents that contains an instance of the term -- "sqlite" separated by two or fewer terms from an instance of the term "acid", -- which is in turn separated by two or fewer terms from an instance of the term -- "relational". SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational'; -- This query matches no documents. There is an instance of the term "sqlite" with -- sufficient proximity to an instance of "acid" but it is not sufficiently close -- to an instance of the term "relational". SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
Phrase and
Set operations
The three basic query types described above may be used to query the full-text index for the set of documents that match the specified criteria. Using the- *
-
The ANDoperator determines the intersection of two sets of documents.
- *
-
The ORoperator calculates the union of two sets of documents.
- *
-
The NOToperator may be used to compute the relative complement of one set of documents with respect to another.
The
The
Boolean operations as just described correspond to the so-called ``enhanced query syntax'' of sqlite; this is the version compiled with "DBD::SQLite", starting from version 1.31. A former version, called the ``standard query syntax'', used to support tokens prefixed with '+' or '-' signs (for token inclusion or exclusion); if your application needs to support this old syntax, use DBD::SQLite::FTS3Transitional (published in a separate distribution) for doing the conversion.
TOKENIZERS
Concept
The behaviour of full-text indexes strongly depends on how documents are split into tokens; therefore
CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)
where "<tokenizer>" is a sequence of space-separated words that triggers a specific tokenizer. Tokenizers can be SQLite builtins, written in C code, or Perl tokenizers. Both are as explained below.
SQLite builtin tokenizers
SQLite comes with some builtin tokenizers (see <www.sqlite.org/fts3.html#tokenizer>) :- simple
-
Under the simple tokenizer, a term is a contiguous sequence of
eligible characters, where eligible characters are all alphanumeric
characters, the ``_'' character, and all characters with UTFcodepoints greater than or equal to 128. All other characters are discarded when splitting a document into terms. They serve only to separate adjacent terms.
All uppercase characters within the
ASCIIrange (UTFcodepoints less than 128), are transformed to their lowercase equivalents as part of the tokenization process. Thus, full-text queries are case-insensitive when using the simple tokenizer. - porter
- The porter tokenizer uses the same rules to separate the input document into terms, but as well as folding all terms to lower case it uses the Porter Stemming algorithm to reduce related English language words to a common root.
- icu
-
The icu tokenizer uses the ICUlibrary to decide how to identify word characters in different languages; however, this requires SQLite to be compiled with the "SQLITE_ENABLE_ICU" pre-processor symbol defined. So, to use this tokenizer, you need edit Makefile.PL to add this flag in @CC_DEFINE, and then recompile "DBD::SQLite"; of course, the prerequisite is to have anICUlibrary available on your system.
- unicode61
-
The unicode61 tokenizer works very much like ``simple'' except that it
does full unicode case folding according to rules in Unicode Version
6.1 and it recognizes unicode space and punctuation characters and
uses those to separate tokens. By contrast, the simple tokenizer only
does case folding of ASCIIcharacters and only recognizesASCIIspace and punctuation characters as token separators.
By default, ``unicode61'' also removes all diacritics from Latin script characters. This behaviour can be overridden by adding the tokenizer argument "remove_diacritics=0". For example:
-- Create tables that remove diacritics from Latin script characters -- as part of tokenization. CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61); CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1"); -- Create a table that does not remove diacritics from Latin script -- characters as part of tokenization. CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");
Additional options can customize the set of codepoints that unicode61 treats as separator characters or as token characters --- see the documentation in <www.sqlite.org/fts3.html#unicode61>.
If a more complex tokenizing algorithm is required, for example to implement stemming, discard punctuation, or to recognize compound words, use the perl tokenizer to implement your own logic, as explained below.
Perl tokenizers
Declaring a perl tokenizerIn addition to the builtin SQLite tokenizers, "DBD::SQLite" implements a perl tokenizer, that can hook to any tokenizing algorithm written in Perl. This is specified as follows :
CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')
where "<perl_function>" is a fully qualified Perl function name (i.e. prefixed by the name of the package in which that function is declared). So for example if the function is "my_func" in the main program, write
CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')
Writing a perl tokenizer by hand
That function should return a code reference that takes a string as single argument, and returns an iterator (another function), which returns a tuple "($term, $len, $start, $end, $index)" for each term. Here is a simple example that tokenizes on words according to the current perl locale
sub locale_tokenizer { return sub { my $string = shift; use locale; my $regex = qr/\w+/; my $term_index = 0; return sub { # closure $string =~ /$regex/g or return; # either match, or no more token my ($start, $end) = ($-[0], $+[0]); my $len = $end-$start; my $term = substr($string, $start, $len); return ($term, $len, $start, $end, $term_index++); } }; }
There must be three levels of subs, in a kind of ``Russian dolls'' structure, because :
- *
-
the external, named sub is called whenever accessing a FTStable with that tokenizer
- *
- the inner, anonymous sub is called whenever a new string needs to be tokenized (either for inserting new text into the table, or for analyzing a query).
- *
- the innermost, anonymous sub is called repeatedly for retrieving all terms within that string.
Using Search::Tokenizer
Instead of writing tokenizers by hand, you can grab one of those already implemented in the Search::Tokenizer module. For example, if you want ignore differences between accented characters, you can write :
use Search::Tokenizer; $dbh->do(<<"") or die DBI::errstr; CREATE ... USING fts4(<columns>, tokenize=perl 'Search::Tokenizer::unaccent')
Alternatively, you can use ``new'' in Search::Tokenizer to build your own tokenizer. Here is an example that treats compound words (words with an internal dash or dot) as single tokens :
sub my_tokenizer { return Search::Tokenizer->new( regex => qr{\p{Word}+(?:[-./]\p{Word}+)*}, ); }
Fts4aux - Direct Access to the Full-Text Index
The content of a full-text index can be accessed through the virtual table module ``fts4aux''. For example, assuming that our database contains a full-text indexed table named ``ft'', we can declare :
CREATE VIRTUAL TABLE ft_terms USING fts4aux(ft)
and then query the "ft_terms" table to access the list of terms, their frequency, etc. Examples are documented in <www.sqlite.org/fts3.html#fts4aux>.
How to spare database space
By default,In particular, the option for contentless
CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)
Data can be inserted into such an
-- This statement is Ok: INSERT INTO t1(docid, a, b) VALUES(1, 'a b c', 'd e f'); -- This statement causes an error, as no docid value has been provided: INSERT INTO t1(a, b) VALUES('j k l', 'm n o');
Of course your application will need an algorithm for finding the external resource corresponding to any docid stored within SQLite.
When using placeholders, the docid must be explicitly typed to
my $sth = $dbh->prepare("INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)"); $sth->execute(2, 'aa', 'bb'); # constraint error
but it works with an explicitly cast :
my $sql = "INSERT INTO t1(docid, a, b) VALUES(CAST(? AS INTEGER), ?, ?)", my $sth = $dbh->prepare(sql); $sth->execute(2, 'aa', 'bb');
or with an explicitly typed ``bind_param'' in
use DBI qw/SQL_INTEGER/; my $sql = "INSERT INTO t1(docid, a, b) VALUES(?, ?, ?)"; my $sth = $dbh->prepare(sql); $sth->bind_param(1, 2, SQL_INTEGER); $sth->bind_param(2, "aa"); $sth->bind_param(3, "bb"); $sth->execute();
It is not possible to
Contentless
AUTHOR
Laurent Dami <dami@cpan.org>COPYRIGHT
Copyright 2014 Laurent Dami.Some parts borrowed from the <sqlite.org> documentation, copyright 2014.
This documentation is in the public domain; you can redistribute it and/or modify it under the same terms as Perl itself.