What is GOBY?
GOBY is a benchmark dataset designed for evaluating data integration techniques specifically for
enterprise data.
It was derived from a real-world production workload in the event promotion and marketing domain, compiled
around 2017.
Unlike public benchmarks, GOBY focuses on private datasets, making it more representative of enterprise
challenges.
Where was it collected?
The data was collected over several years using over 1,000 wrappers developed by professionals.
These wrappers converted web pages and APIs into relational tables, creating a rich dataset.
What does it represent?
GOBY represents over 4 million rows of data corresponding to events.
It includes detailed semantic labels such as event locations, organizers, and other metadata relevant to
the domain.
It highlights the structural and semantic complexity often found in enterprise datasets.
What does it contain?
- Source Tables: Nearly 1,200 source tables, each generated from a wrapper.
- Semantic Types: A hierarchy of semantic types developed by domain experts.
- Universal Schema: A unified schema combining all source tables.
- Statistics:
- 4.04 million rows
- 23,203 columns
- Average 3,405 rows per table
- Average 20 columns per table
GOBY is semantically richer and structurally more complex than typical public benchmarks, such as VizNet or T2Dv2, making it well-suited for enterprise data integration tasks.
The primary data archive, goby.tar.gz
, contains the following key directories:
dump/
: PostgreSQL dump files that include:
doit_categories
: Data categories with record counts.doit_data
: Triple-based data representing (category_id, source_id, entity_id, name,
value).To access the GOBY dataset:
goby.zip
file with the button below using the password:
GOBY2025
unzip -P your_password goby.zip -d /path/to/extract/
.
If you use this dataset, cite our companion CIDR 2025 paper:
@inproceedings{cidr-goby,
author = {Moe Kayali and Fabian Wenz and Nesime Tatbul and Cagatay Demiralp},
title = {Mind the Data Gap: Bridging Large Language Models (LLMs) to Enterprise Data Integration},
booktitle = {15th Conference on Innovative Data Systems Research, {CIDR} 2025,
Amsterdam, The Netherlands January 19-22, 2025},
publisher = {www.cidrdb.org},
year = {2025}
}
Your support in improving this dataset is greatly appreciated! If you have any questions or feedback, please send an email to Moe Kayali.