Gadfly Frequently Asked Questions
=================================

:Version: $Revision: 1.1.1.1 $

.. contents::

General information
-------------------

The following topics are of a general nature.

What is gadfly?
~~~~~~~~~~~~~~~~~
Gadfly  is  a  relational  database management system which uses a
large subset of very standard SQL as its query language and Python
modules  and optional Python/C extension modules as its underlying
engine. Gadfly stores the active database in memory, with recovery
logging  to  a  file  system. It supports an optional TCP/IP based
client  server  mode  and log based failure recovery for system or
software failures (but not for disk failures).

What is the current version?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The current version is Gadfly 1.0. To be sure you have the current
version,  please  consult  the  version  number  that comes in the
documentation.

Gadfly 1.0 attempts to fix all known bugs in previous versions and
adds a small number of features (hopefully without adding too many
new bugs).

Where can I get it?
~~~~~~~~~~~~~~~~~~~~~
Please get it from http://www.chordate.com

Why SQL and the relational model?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly  is based in SQL and the relational model primarily because
the  SQL  query language and the relational model have been highly
successful,  are  highly  standard,  and  are well understood by a
large  number  of  programmers. If you understand standard SQL you
already  know  how  to use Gadfly -- and this distinguishes Gadfly
from  many  similar  database systems out there -- even ones which
aren't "really free."

Why Python?
~~~~~~~~~~~~~
Gadfly  is  based in Python because Python made the development of
such  a  relatively  sophisticated and standard SQL implementation
feasible  for  one programmer, in his spare time, and also because
it  automatically  made  Gadfly  portable  to almost any computing
platform.  In  addition, Gadfly can be used with Python to develop
sophisticated data manipulation applications easily.

What is the License? Is it Free? Why?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly  can  be  used and modified in source form for any purpose,
including  commercial  purposes.  There  is no explicit or implied
warrantee  for  Gadfly,  even  though it is intended to be useful.
Gadfly  is free because I wanted people to use it, and I suspected
they   wouldn't  if  it  wasn't  free  and  highly  standard,  and
furthermore  I  didn't  want  to  risk  the  possibility  of legal
responsibility if people did use it and had problems.

You mean I can bundle it into my own product for free? Really?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes. Please do. Several organizations have done so already.

Where does it run?
~~~~~~~~~~~~~~~~~~~~
To run Gadfly you need (1) a file system and (2) Python. If Python
runs on your platform Gadfly should run there too. I have not been
able  to verify that this version runs on the Macintosh yet (since
the  Mac  apparently  has  some  fairly  nonstandard  file  system
conventions),  but  even  if  it doesn't it should run after a few
trivial  adjustments.  Please let me know if Gadfly doesn't run on
any of your platforms.

Gadfly  is  expected  to  run  "out  of the box" on Windows 95/98,
Windows  NT,  all Unices or unix like platforms (Solaris, IX, BSD,
Irix,  Linux)  and  anywhere else where Python runs and there is a
filesystem (Windows CE, QNX, Mac, etcetera).

Furthermore  the  Gadfly  database  storage  format  (based in the
Python  marshal  format)  is  binary  compatible between all these
systems -- you can create a database on any one of the systems and
use  a  binary  transfer  mechanism  to  any  other system and the
database should load and run.

What are the software/hardware requirements?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly  requires  a  Python installation and sufficient memory and
disk  on  the  target  machine  to  store  the  live  databases in
question.  Since  live  Gadfly databases are kept in memory during
active  use  Gadfly is not appropriate for databases that approach
the  size of virtual memory. In practice this means that on modern
inexpensive  machines  Gadfly  can  store and manipulate databases
with  tens  of  thousands  of rows without difficulty. Unless your
machine  is  large  databases  with  millions  of  rows  might  be
infeasible  --  and  even  if memory is sufficient the loading and
storing  of  the database might make startup and shutdown speed an
issue.  Many  interesting  database  instances never approach this
size.

Is there a query optimizer?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes,  Gadfly automatically optimizes queries using some relatively
sophisticated  methods  that  attempt  to  take  advantage  of any
available   indices,   among   other   methods.   Technically  all
optimization  uses  only  "equality  predicates"  at this time, so
queries  involving  many  inequalities (eg, BETWEEN) and few or no
equalities  might be slow, or might create very large intermediate
results.   Most   standard   database  applications  use  equality
constraints extensively (eg, as foriegn keys or relationships) but
if  your  application  is,  say,  a  scientific  application  that
requires  many inequality comparisons across multiple tables among
floating point data, the Gadfly optimizer may not help much -- you
may  want  to  load  the  data  into  your own structures and hand
optimize the access and combination methods.

On  the  other  hand, if your application uses equality predicates
often  enough  you  may find that the Gadfly optimizer does pretty
well, perhaps better than a "hand coded home grown database" would
do without a lot of work.

Is it fast?
~~~~~~~~~~~~~
You  probably  want  a benchmark comparison. I have none to offer.
The  query evaluation and load/store facilities for Gadfly seem to
be  pretty fast for non-huge databases -- particularly if you make
intelligent  use of indices. At this point the slowest part is the
SQL  parser  itself.  If  you  use a lot of similar SQL statements
please  consult  the  documentation  on  how  to  avoid  reparsing
statements by use of multiple cursors and dynamic parameters.

Unverified  anecdotal reports suggest that Gadfly is not noticably
slower  than other free or freeish portable database systems; some
have  suggested  it can even be faster, at least for certain types
of use.

The  entire query engine is designed to use the kjbuckets Python C
extension   module,   and   use  of  kjbuckets  should  speed  the
performance  of Gadfly considerably -- Gadfly will run without it,
but noticably slower (2x slower for even small databases, more for
larger ones).

Are there data size limitations?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are no intrinsic limitations. For example a "varchar" can be
of  any  size.  That said, remember that a gadfly database, in the
absense  of  hackery  (eg,  storing a filename instead of a value)
stores  all data in memory, so at some point large values may fill
up virtual memory and the load/store operations can get slowish.

What about recovery after crashes?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly  supports  a  log with deferred updates recovery procedure,
together  with  commits  and  rollbacks.  This  means that if your
program  crashes  or if your computer crashes it should be able to
restore  a  Gadfly  database  to  the state of the database at the
point  of  the  last "commit". Any remaining bugs in this area are
likely  to have to do with data definition statements ("create" or
"drop") so be careful to commit and checkpoint after modifying the
data  definitions.  Of  course, there are no known bugs, but it is
possibly that not all possible combinations have been tested (this
area  is quite subtle, unfortunately :( ). Please see the recovery
documentation for more information.

What about client/server based access?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly supports a TCP/IP based client/server mode, together with a
non-standard but highly flexible security paradigm for the server.
Please see the client/server mode documentation.

What about concurrency control?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly  does  not have true concurrency control. The only safe way
to  allow  two concurrent agents to modify a single database is to
have  both  of  them  communicate  with  the database via a gadfly
server  that  arbitrates  database accesses. The server will serve
each  agent  serially -- one at a time -- in separate transactions
each committed immediately.

Of  course,  two  programs  can  have read-only access to the same
database in separate memory instances without any problems.

Is there a mailing list?
~~~~~~~~~~~~~~~~~~~~~~~~~~
Not yet - there may be once sufficient volunteers put their hands up
to help maintain the project.


Who uses it?
~~~~~~~~~~~~~~
My  email suggests that quite a few people have been using Gadfly,
including  professional  and  industrial  organizations of various
sorts.  From  my  vantage point it is not clear how serious all of
these  uses  have  been  -- but some of them appear quite serious.
Gadfly  has been included in a number of CD-rom publications, such
as Linux distributions.

Zope Corporation includes gadfly in Zope.


Is support available?
~~~~~~~~~~~~~~~~~~~~~~~
Not at present. There may be once sufficient volunteers put their hands up to
help maintain the project.


What are the plans for Gadfly?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At  this  point  the  current  plan  is  to  maintain  the current
implementation  and  fix  any  bugs  that  arise. Some time in the
distant future (several months maybe) a new release with major new
features might arrive, but only if the new features allow existing
applications  to  run. Gadfly will support backwards compatibility
of existing databases.

How do I contribute some code that uses Gadfly?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Please contact the project maintainers.


Installation
------------

This  section  relates  to  the  installation of Gadfly and connecting
Gadfly to other software.

Does Gadfly support ODBC or JDBC?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Although  the  Gadfly  SQL  subset  is  based  on the ODBC 2.0 SQL
definition  Gadfly  does  not  support  the  C-level  ODBC calling
mechanism.  All  direct accesses to Gadfly must use Python modules
at this time.

How can I move a Gadfly database instance from one location/machine to another?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A  gadfly  database  instance as stored in a file system directory
may  be  copied to another directory even on another machine using
any  binary  file copy mechanism. You must copy all files relating
to  the  database instance in the gadfly database directory to the
destination,  and be sure that if you are copying across different
platforms to use a binary copy mechanism (eg, ftp in binary mode).

How can I access gadfly from another program?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The only supported API (applications program interface) for gadfly
at  this  writing  is  the Python Database API either using direct
access  or  via the client/server mechanism. Although you must use
Python  to access a gadfly database at this time it is possible to
embed  a Python instance in another program. Please see the Python
documentation.

What database size limits are there?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As  mentioned  there are no intrinsic limits (eg, a varchar can be
of  any  size)  except  for  the  limitations  of  memory  and the
possibility  that the load/store mechanism may get too slow if the
database grows too large.

What are the file types in the gadfly database directory?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- \*.grl -- a relation representation where * is the relation name.
- \*.brl -- a back-up relation representation (for possible recovery).
- \*.gfd -- a data definitions file where * is the database name.
- \*.bfd -- a back-up data definitions file (for possible recovery).
- \*.gfl -- a log file where * is the database name
- \*.glb -- a back-up log file (for possible recovery).

Use
---

The following section relates to the use of gadfly.

Does  Gadfly  support virtual tables (VIEWS)? Quantifiers? EXISTS?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Aggregates? Groupings? Indices?
Of  course  it  does! Gadfly supports a very large SQL subset. See
the additional documentation for more detailed information.

How standard is the SQL subset supported?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gadfly  SQL  is based on ODBC 2.0 SQL which in turn is based on an
SQL  standard draft. This means that Gadfly SQL adheres closely to
the SQL you find in many other database products and documented in
many  books  and  other  documentation.  A  number  of people have
implemented  Gadfly  databases and transferred the SQL with few to
no  modifications  from  the  implementation directly to Oracle or
MS-Access, for example.

Are BLOBS (binary large objects) and arbitrary string values supported?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes,  varchars  have  no  intrinsic  size limit and can be used to
store  marshalled  or  pickled Python BLOBS -- but the application
using  gadfly  will have to know when to "deserialize" the objects
(using  marshal  or pickle or other mechanisms). For example it is
even  possible  to  store  Python  byte  code  objects in a gadfly
database  in  this  way.  In particular gadfly stores all strings,
including  strings  with  null  bytes  and  newlines, but for such
strings  you must use dynamic parameters (or other mechanisms that
avoid embedding the string directly into the SQL syntax), like::

  s = chr(0)+chr(10)
  cursor.insert("insert into table x(v) values (?)", (s,))

NOT::

  s = chr(0)+chr(10)
  cursor.insert("insert into table x(v) values ('%s')" % (s,))

Since the SQL parser will choke on the latter.

How can I find the column names in a "select \*"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The  nasty  answer is "don't use select \*", but if you really want
to  you  can  get  the  order  of  names  for the columns from the
description attribute for a cursor::

    >>> from gadfly import gadfly
    >>> g = gadfly("test", "dbtest")
    >>> c = g.cursor()
    >>> c.execute("select * from work")
    >>> c.description
    (('RATE', None, None, None, None, None, None), ('NAME', None, None, None,
    None, None, None), ('HOURS', None, None, None, None, None, None))
    >>> print c.pp()
    RATE  | NAME    | HOURS
    =======================
    40.2  | sam     | 30
    10.2  | norm    | 45
    5.4   | woody   | 80
    4.4   | diane   | 3
    12.9  | rebecca | 120
    200.0 | cliff   | 26
    3.5   | carla   | 9

IE,  c.description[i][0]  gives  the  name  of the ith column. The
query mechanism essentially randomizes the order of the columns in
a  select * and you cannot rely on the engine producing columns in
any  specific  order  unless  you  specify  the order in the query
explicitly.

Can I add my own table implementation?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes  you can, to a limited extent. Use the remote view protocol in
gfintrospect.py.  The  table  you add can extract information from
any  source  but  must  not  recursively  query  the same database
instance  (using  a  Python  implementation)  and  place it into a
Gadfly database either only once on first usage or once per query.
The  table  added  cannot  be  updated via SQL and you cannot load
"only  the  part  of the table you need for this query." The table
added   must   be   explicitly   re-added   during   the  database
initialization  on  each  usage,  and  if  you implement the table
incorrectly  you  may cause some gadfly queries to crash. Use with
caution: this is an advanced feature. See remotetest.py.

Why can't I update my own table implementation via SQL?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Although  it  is  possible  to  create a protocol which allows non
standard  table  implementations  to  be updated and optimized via
SQL,  the implications to the query optimization mechanism and the
crash  recovery strategy are not clear and may be quite subtle. At
this  time  the  implementation opts to avoid possible bugs by not
supporting  such  features,  although  programmers  are welcome to
experiment  at  their  own risk, with the understanding that their
experimental   modifications   may  not  be  supported  in  future
releases.

It is perfectly possible to change the number of rows or values in
no  modifications  from  the  implementations without using SQL or
gadfly,  however  and  have the updates automatically reflected in
the  table  instance  in  the  database. [You should not, however,
change the number or names of the columns.]

For  example  gfintrospect.DictKeyValueView  will  "wrap" a Python
dictionary   as   a   gadfly   table   and  automatically  reflect
modifications   made   by   an  external  Python  program  to  the
dictionary.

How do you define a primary key?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
New  in  1.0 you can get the effect of a primary key by defining a
unique index on the primary key columns::

    create unique index pkey on person(firstname, lastname)

Effectively  will enforce a primary key constraint for (firstname,
lastname) on the person table.

What about NULLs, Triggers and other missing stuff?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The   present  release  opted  not  to  add  missing  standard  or
non-standard   features   that   were   likely   to   cause  major
modifications   to  large  sections  of  the  implementation,  and
therefore were likely to introduce bugs.

Where is the LIKE predicate?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The  LIKE  predicate for string matching is still not supported at
the  SQL  level.  For  what it's worth, it is easy to use Python's
string matching (regex, re, string.search, etcetera) facilities on
the result of a query. Also, for what it's worth, since the gadfly
optimizer won't easily be able to optimize for string matching the
"by hand" method would essentially be what gadfly would do anyway,
without major modifications to the implementation.

After a crash Gadfly won't recover! Help!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This  shouldn't  happen,  but  did  happen  (at  least once) for a
previous  release.  If it happens again, you can explicitly delete
the  log files from the database directory in order to recover the
database  to  a state which may or may not correspond to the state
of the database at the second to last commit/checkpoint operation.
It  is  possible,  but  not  likely,  that the database state will
include  some  but  not  all updates from the last commit, but, to
repeat,  it  shouldn't  happen.  Please  report  the problem if it
occurs again.

