What is GOBY?
GOBY is a benchmark dataset designed for evaluating data integration techniques specifically for enterprise data.
It was derived from a real-world production workload in the event promotion and marketing domain, compiled around 2017.
Unlike public benchmarks, GOBY focuses on private datasets, making it more representative of enterprise challenges.
Where was it collected?
The data was collected over several years using over 1,000 wrappers developed by professionals.
These wrappers converted web pages and APIs into relational tables, creating a rich dataset.
What does it represent?
GOBY represents over 4 million rows of data corresponding to events.
It includes detailed semantic labels such as event locations, organizers, and other metadata relevant to the domain.
It highlights the structural and semantic complexity often found in enterprise datasets.
What does it contain?
- Source Tables: Nearly 1,200 source tables, each generated from a wrapper.
- Semantic Types: A hierarchy of semantic types developed by domain experts.
- Universal Schema: A unified schema combining all source tables.
- Statistics:
- 4.04 million rows
- 23,203 columns
- Average 3,405 rows per table
- Average 20 columns per table
GOBY is semantically richer and structurally more complex than typical public benchmarks, such as VizNet or T2Dv2, making it well-suited for enterprise data integration tasks.
The primary data archive, goby.tar.gz
, contains the following key directories:
dump/
: PostgreSQL dump files that include:
doit_categories
: Data categories with record counts.doit_data
: Triple-based data representing (category_id, source_id, entity_id, name, value).To access the GOBY dataset:
goby.zip
file with the button below using the password:
GOBY2025
unzip -P your_password goby.zip -d /path/to/extract/
.Your support in improving this dataset is greatly appreciated! If you have any questions or feedback, please send an email to Moe Kayali.