Data represents a variety of useful information that often needs to be stored, sorted, categorized and analyzed to inform decision-making. Data is organized in data structures which represent the data as entities with attributes or characteristics.
Data can be classified as structured, semi-structured or unstructured.
Structured Data
Structured data has a fixed schema where all the data share the same fields and data type for each field. The schema for structured data is usually tabular with columns for the fields and rows for each entity. Structured data is often stored in databases with multiple tables that can reference each other with key values in a relational model.
ID
Name
Surname
Email
1
Naiomi
Naidoo
Naiomi.Naidoo@technology.online
2
Firstname
Lastname
Firstname@yahoo.com
Structured data in a table
Semi-structured data
Semi-structured data is information that has some structure but there is variation between the entity instances.
Scenario: Some customers may have an email address while others may have multiple email addresses or no email address at all.
JavaScript Object Notation (JSON) is a common data format used for representing semi-structured data because of it’s flexible nature.
Cosmos DB is a distributed database engine with core features provided for any type of implementation model.
Features of Cosmos DB
Turnkey global distribution
Cosmos DB enables global data distribution and availability as a configuration setting in the portal, via command-line or ARM template, making data replication to a new location within the chosen region as seamless as possible. Both manual and automatic failover is supported as well as multi-read and multi-write from primary and replica databases.
Elastic storage and throughput
Cosmos DB will automatically scale database storage and throughput in a pay for consumption based model. There is no need to pre-provision resources to account to future growth. Cosmos DB measures throughput in a standardized way referred to as Request Units (RUs) and can be considered as an abstraction of physical resources. RUs are provisioned per second, eg. 2000 RU/s.
Throughput is provisioned at a database or container level.
Container Level
Database Level
Isolated throughput
Containers share throughput
Low latency
Microsoft’s financially back SLA provides performance metrics for read and write requests < 10 ms 99% of the time.
Flexible consistency model
Data replication options are available over 5 sliding scale consistency models to optimize the database for a specific workload. Consistency can be configured globally per connection.
A unified security model exists across all APIs, providing built-in encryption at rest and in-transit. IP-based access control is supported.
To connect to a Cosmos DB, 2 pairs of keys, read-write and read-only are used and managed by the service to control access to the account and data.
APIs
Cosmos DB exposes data through a variety of models and APIs. When you request data using a specific API, Cosmos DB will automatically handle the translation of data from the underlying data format to the data model required for the API.
API
Description
SQL API
Core API with many unique features.
Supports JavaScript logic and SQL queries.
MongoDB API
Compatible with MongoDB v3.2 protocol.
Supports aggregation pipeline.
Gremlin API
Compatible with the Apache TinkerPop graph traversal language (Gremlin).
Returns results in GraphSON (extended JSON) format.
Table API
Service-level compatibility with Azure Storage Tables.
Migrate applications with no code changes.
Cassandra API
Supports Cassandra Query Language (CQL) v4 protocol.
Works out of the box with CQL shell.
etcd API
Implements etcd wire protocol.
Can be used as a backing store for Azure Kubernetes Service.
Resource Model
Data in Azure Cosmos DB is stored in a hierarchy of resources.
Indexing
Cosmos DB automatically indexes all fields within all items or documents by default. While indexing can be useful for many workloads, indexing all fields and items can have a performance impact on more complex data sets.
Performance optimization to control and tune indexing is possible to balance trade-offs between write and query performance.
Index policies can be created to configure indexes by specifying the following: