There are many common data exchange formats available. Not every format is suitable for data streaming. In this post, we’ll explore popular formats and their limitations. We’ll find out how JSONLines can be useful in data streaming.
CSV (Comma-Separated Values) format is one of them, in which each row represents a record.
id,company_name,net_worth_in_billions
1,XYZ,6.58
2,FFS,2.11
3,IFO,6.66
This format has some restrictions as it can represent only two-dimensional data. To add nested data of companies there are a couple of options:
- duplicate parent row for each new child data, or
- create a separate CSV file, with the child data in it linked with
id
in the main CSV file. This requires using operations like the JOIN operation in SQL to process the data.
JSON is another format that solves the problem of nesting. We can represent the same information like this.
[
{
"id": 1,
"company_name": "XYZ",
"net_worth_in_billions": 6.58
},
{
"id": 2,
"company_name": "FFS",
"net_worth_in_billions": 2.11
},
{
"id": 3,
"company_name": "IFO",
"net_worth_in_billions": 6.66
}
]
To add extra child companies info, we can append each item in array with a child_companies
property:
[
{
"id": 1,
"company_name": "XYZ",
"net_worth_in_billions": 6.58,
"child_companies": [
{
"id": 6,
"name": "Juarez"
}
]
},
{
"id": 2,
"company_name": "FFS",
"net_worth_in_billions": 2.11,
"child_companies": []
},
{
"id": 3,
"company_name": "IFO",
"net_worth_in_billions": 6.66,
"child_companies": []
}
]
With the ability to represent data of any shape, this is one of the most popular data transfer formats.
The problem arises when the data we need to send becomes very large and we need to process the data as it comes.
When using JSON format until the last bit of data is transferred over the wire, we can not process the data.
We have a better solution to this problem – JSONLines (also called New-Line Deliminited JSON (ND-JSON)). In JSONLines, each line is a valid JSON object. Each line represents a new record. This makes it possible to seek a random line, and process it in chunks. In this format, you can process the data as you read from a stream, without the need to read the complete files. It is a huge deal if you processing files that are in many Giga-Bytes and Tera-Bytes.
The format would look something like this:
{"id": 1, "company_name": "XYZ", "net_worth_in_billions": 6.58, "child_companies": [{"id": 6, "name": "Juarez"}]}
{"id": 2, "company_name": "FFS", "net_worth_in_billions": 2.11, "child_companies": []}
{"id": 3, "company_name": "IFO", "net_worth_in_billions": 6.66, "child_companies": []}
The parsing logic is simple, read one line, parse it and then move on to the next one.
The JSON lines has .jsonl
or .jl
as file extensions and application/jsonlines
MIME-type . A few places application/jsonlines+json
MIME-type, to help web browsers render it as JSON.
Tooling
Many tools and libraries already have support for JSON lines.
In jq
, you can pass a JSONLines to have a pretty-formatted JSON.
In Pandas, you can do lines=True
while reading the dataframe from a file like this:
Conclusion:
We saw that while JSON is an excellent text-based data format. It has a good balance of readability and extensibility but is a poor choice when it comes to stream processing of large datasets.
JSONLines is a close alternative to JSON, especially for a huge amount of data. It allows efficient storage, transfer, and processing of data.