Document Stores: MongoDB, CouchDB and others

The first NoSQL databases that I want to have a closer look are document stores. The most popular document store at the moment is MongoDB. In document stores the data is stored in self contained documents, other members are denormalized.

Example

{"country": "Switzerland", "population": 8236303,
 "languages":
    {"German": "63%", "French": "23%", "Italien": "8%", "Romansch": "0.5"}
}

In a relational database management system (RDBMS), you would store the language informations in a separate table with a foreign key reference to the country, while in a document store everything belonging to the country is self contained in the document. This means you do not have to join the country and the language tables to get the information which languages are spoken in a country.

Another difference to RDBMS is that document stores are schema-less. That means no structure is explicitly imposed on the documents and documents in the same collection can have different structures.

Of course as the data is usually read by applications which will need to rely on some attributes being availble to be able to work with the data. That means that there is an implicit schema implied by the application using the document store.

Don’t having a schema has some advantages:

During development, you will not need to recreate new tables when you decide to add or rename an attribute (of course you’ll have to migrate your data after renaming your attribute or let your code rewrite the document when accessing an old attribute).
You can easily import data from any JSON web service into the document store and only need to reason about its structure when you access it. With a RDMS you would first have to identify the fields and if you need to distribute the data to different tables

On the other hand, the lack of an explicit schema has the disadvantage that you may get some hard to find bugs if you mispelled one attribute when writing into a document. Nothing will prevent you to write to the attribute “langauge” instead of “language” like a relational database with a schema would.

Another disadvantage of storing data in documents is that you potentially store redundant data and have to keep it up to date. If you have for example a collection of people which have the company they work at as embedded attribute, you will store the data about the company once for every employee of the company. This can be avoided by only storing an ID to a company document, but this is not so efficient in document stores and takes away the biggest advantage of document stores which is that the documents are self-contained.

It depends on factors like the read and write frequency of the attribute if it is better to have it embedded in the document or stored as a reference to another document. If the attribute is changed often and contained in many documents, than a reference may be the better choice. If it is read very frequently and seldom changes, it will be better to embed it. It also depends on the kind of relation between document and attribute. For one-to-one or one-to-many relationships it’s always better to embed it. For Many-to-many it depends on the before mentioned frequency of change and access. For one-to-many or many-to-one relations, it depends if the relation has to be accessed in both directions (e.g. you need to access all employees of a company, the employers of a person or both). Bidirectional properties are usually better stored with references, unidirectional as embedded attribute.

In my next articles, I’ll have a closer look at MongoDB. Especially how to query it and how you can call it from JVM languages like Java and Kotlin. I also plan to have a closer look at CouchDB and compare these two documents stores.