Data formats

Data represents a variety of useful information that often needs to be stored, sorted, categorized and analyzed to inform decision-making. Data is organized in data structures which represent the data as entities with attributes or characteristics.

Data can be classified as structured, semi-structured or unstructured.

Structured Data

Structured data has a fixed schema where all the data share the same fields and data type for each field. The schema for structured data is usually tabular with columns for the fields and rows for each entity. Structured data is often stored in databases with multiple tables that can reference each other with key values in a relational model.

IDNameSurnameEmail
1NaiomiNaidooNaiomi.Naidoo@technology.online
2FirstnameLastnameFirstname@yahoo.com
Structured data in a table

Semi-structured data

Semi-structured data is information that has some structure but there is variation between the entity instances.

Scenario: Some customers may have an email address while others may have multiple email addresses or no email address at all.

JavaScript Object Notation (JSON) is a common data format used for representing semi-structured data because of it’s flexible nature.

//Customer 1
{
  "id": "1",
  "name": "Naiomi",
  "surname": "Naidoo",
  "contact":
  {
    "email": "naiomi@naidoo.com",
    "phone": "+27121231234"
  }
}
//Customer 2
{
  "id": "2",
  "name": "Firstname",
  "surname": "Lastname",
  "contact":
  {
    "email": "firstname@yahoo.com",
    "phone": "+27987654321"
  }
  "location":
  {
    "city": "Sandton"
  } 
}

Unstructured data

Documents, images, audio, video and binary files can be considered unstructured data.

Types of unstructured data

Azure Cosmos DB

Cosmos DB is a distributed database engine with core features provided for any type of implementation model.

Features of Cosmos DB

  • Turnkey global distribution

Cosmos DB enables global data distribution and availability as a configuration setting in the portal, via command-line or ARM template, making data replication to a new location within the chosen region as seamless as possible. Both manual and automatic failover is supported as well as multi-read and multi-write from primary and replica databases.

  • Elastic storage and throughput

Cosmos DB will automatically scale database storage and throughput in a pay for consumption based model. There is no need to pre-provision resources to account to future growth. Cosmos DB measures throughput in a standardized way referred to as Request Units (RUs) and can be considered as an abstraction of physical resources. RUs are provisioned per second, eg. 2000 RU/s.

Throughput is provisioned at a database or container level.

Container LevelDatabase Level
Isolated throughputContainers share throughput
  • Low latency

Microsoft’s financially back SLA provides performance metrics for read and write requests < 10 ms 99% of the time.

  • Flexible consistency model

Data replication options are available over 5 sliding scale consistency models to optimize the database for a specific workload. Consistency can be configured globally per connection.

Credit : https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
  • Enterprise-grade security

A unified security model exists across all APIs, providing built-in encryption at rest and in-transit. IP-based access control is supported.

To connect to a Cosmos DB, 2 pairs of keys, read-write and read-only are used and managed by the service to control access to the account and data.

APIs

Cosmos DB exposes data through a variety of models and APIs. When you request data using a specific API, Cosmos DB will automatically handle the translation of data from the underlying data format to the data model required for the API.

APIDescription
SQL APICore API with many unique features.

Supports JavaScript logic and SQL queries.
MongoDB APICompatible with MongoDB v3.2 protocol.

Supports aggregation pipeline.
Gremlin APICompatible with the Apache TinkerPop graph traversal language (Gremlin).

Returns results in GraphSON (extended JSON) format.
Table APIService-level compatibility with Azure Storage Tables.

Migrate applications with no code changes.
Cassandra APISupports Cassandra Query Language (CQL) v4 protocol.

Works out of the box with CQL shell.
etcd APIImplements etcd wire protocol.

Can be used as a backing store for Azure Kubernetes Service.

Resource Model

Data in Azure Cosmos DB is stored in a hierarchy of resources.

Indexing

Cosmos DB automatically indexes all fields within all items or documents by default. While indexing can be useful for many workloads, indexing all fields and items can have a performance impact on more complex data sets.

Performance optimization to control and tune indexing is possible to balance trade-offs between write and query performance.

Index policies can be created to configure indexes by specifying the following:

  • List of paths to index
  • Different types of indexing to perform
  • List of paths to exclude

Types of indexes

RangeHashSpatial
Provides comparison functionalityQuick lookup for exact match informationUsed for geographical information