The Neglec­ted Varie­ty Dimen­si­on

The three clas­sic dimen­si­ons of Big Data are volu­me, velo­ci­ty and varie­ty. While the­re has been much focus on addres­sing the volu­me and velo­ci­ty dimen­si­ons, the varie­ty dimen­si­on was rather neglec­ted for some time (or tack­led inde­pendent­ly). Howe­ver, mean­while most use cases, whe­re lar­ge amounts of data are avail­ab­le in a sin­gle well-struc­tu­red data for­mat are alrea­dy exploi­ted. The music plays now, whe­re we have to aggre­ga­te and inte­gra­te lar­ge amounts of hete­ro­ge­ne­ous data from dif­fe­rent sources – this is exac­t­ly the varie­ty dimen­si­on. The Lin­ked Data princi­ples empha­si­zing the holistic iden­ti­fi­ca­ti­on, rep­re­sen­ta­ti­on and lin­king allow us to address the varie­ty dimen­si­on. As a result, simi­lar­ly as we have with the Web a vast glo­bal infor­ma­ti­on sys­tem, we can build with the Lin­ked Data princi­ples a vast glo­bal dis­tri­bu­t­ed data space (or effi­ci­ent­ly inte­gra­te enter­pri­se data). This is not only a visi­on, but has star­ted and gains more and more trac­tion as can be seen with the schema.org initia­ti­ve, European or the Inter­na­tio­nal Data Spaces.

From Big Data to Cognitive Data

The three clas­sic dimen­si­ons of Big Data are volu­me, velo­ci­ty and varie­ty. While the­re has been much focus on addres­sing the volu­me and velo­ci­ty dimen­si­ons (e.g. with dis­tri­bu­t­ed data pro­ces­sing frame­works such as Hadoop, Spark, Flink), the varie­ty dimen­si­on was rather neglec­ted for some time. We have not only a varie­ty of data for­mats – e.g. XML, CSV, JSON, rela­tio­nal data, graph data, … – but also data dis­tri­bu­t­ed in lar­ge value chains or depart­ments in side a com­pa­ny, under dif­fe­rent gover­nan­ce regimes, data models etc. etc. Often the data is dis­tri­bu­t­ed across dozens, hund­reds or in some use cases even thousands of infor­ma­ti­on sys­tems.

The fol­lo­wing figu­re demons­tra­tes, that the bre­akth­roughs in AI are main­ly rela­ted to the data – while algo­rithms were devi­sed ear­ly and are rela­tively old, only once sui­ta­ble (trai­ning) data sets beca­me avail­ab­le, we are able to exploit the­se AI algo­rithms.

Ano­t­her important fac­tor of cour­se is com­pu­ting power, which thanks to Moore’s law, allows us to effi­ci­ent­ly pro­cess after every 4–5 years data being a magnitu­de lar­ger than befo­re.

In order to deal with the varie­ty dimen­si­on, we need a lin­gua fran­ca for data mode­ra­ti­on, which allows to:

  1. uni­que­ly iden­ti­fy small data ele­ments wit­hout a cen­tral iden­ti­fier aut­ho­ri­ty. This sounds like a small issue, but iden­ti­fier clas­hes are pro­bab­ly the big­gest chal­len­ge for data inte­gra­ti­on.
  2. map from and to a lar­ge varie­ty of data models, sin­ce the­re are and always will be a vast num­ber of dif­fe­rent spe­cia­li­zed data rep­re­sen­ta­ti­on and sto­rage mecha­nisms (rela­tio­nal, graph, XML, JSON and so on and so forth).
  3. allows dis­tri­bu­t­ed, modu­lar data sche­ma defi­ni­ti­on and incre­men­tal sche­ma refi­ne­ment. The power of agi­li­ty and col­la­bo­ra­ti­on beca­me mean­while wide­ly ack­now­led­ged, but we need to app­ly this for data and sche­ma crea­ti­on and evo­lu­ti­on.
  4. deal with sche­ma and data in an inte­gra­ted way, becau­se what is a sche­ma from one per­spec­tive turns out to be data from ano­t­her one (think of a car pro­duct model – its an instan­ce for the engi­nee­ring depart­ment, but the sche­ma for manu­fac­tu­ring).
  5. allows to gene­ra­te dif­fe­rent per­spec­tives on data, becau­se data is often rep­re­sen­ted in a way sui­ta­ble for a par­ti­cu­lar use case. If we want to exchan­ge and aggre­ga­te data more wide­ly, data needs to be rep­re­sen­ted more inde­pendent­ly and fle­xi­b­ly thus abs­trac­ting from a par­ti­cu­lar use case.

The Lin­ked Data princi­ples (coined by Tim Ber­ners-Lee) allow us to exac­t­ly deal with the­se requi­re­ments:

  1. Use Uni­ver­sal Resour­ce Iden­ti­fiers (URI) to iden­ti­fy the “things” in your data – URIs are almost the same as the URLs we use to iden­ti­fy and loca­te Web pages and allow us to retrie­ve and link an the glo­bal Web infor­ma­ti­on space. We also do not need a cen­tral aut­ho­ri­ty for coining the iden­ti­fiers, but ever­yo­ne can crea­te his own URIs sim­ply by using a domain name or Web space under his con­trol as pre­fix. „Things“ refers here to any phy­si­cal enti­ty or abs­tract con­cept (e.g. pro­duc­ts, orga­ni­za­ti­ons, loca­ti­ons and their properties/attributes etc.)
  2. Use http:// URIs so peop­le (and machi­nes) can look them up on the web (or an intra/extranet) – an important aspect is that we can use the iden­ti­fiers also for retrie­ving infor­ma­ti­on about them. A nice side effect of this is that we can actual­ly veri­fy the pro­ven­an­ce of infor­ma­ti­on by retrie­ving the infor­ma­ti­on about a par­ti­cu­lar resour­ce from its ori­gi­nal loca­ti­on. This helps to estab­lish trust in the dis­tri­bu­t­ed glo­bal data space.
  3. When a URI is loo­ked up, return a descrip­ti­on of the thing in the W3C Resour­ce Descrip­ti­on For­mat (RDF) – as we have a uni­fied infor­ma­ti­on rep­re­sen­ta­ti­on tech­ni­que on the Web with HTML, we need a simi­lar mecha­nism for data. RDF is rela­tively simp­le and allows to rep­re­sent data in a seman­tic way and to mode­ra­te bet­ween many dif­fe­rent other data models (I will exp­lain this in next weeks arti­cle).
  4. Inclu­de links to rela­ted things – as we can link bet­ween web pages loca­ted on dif­fe­rent ser­vers or even dif­fe­rent ends of the world, we can reu­se and link to data items. This is a cru­ci­al aspect to reu­se data and defi­ni­ti­ons ins­tead of recrea­ting them over and over again and thus estab­lish a cul­tu­re of data col­la­bo­ra­ti­on.

As a result, simi­lar­ly as we have with the Web a vast glo­bal infor­ma­ti­on sys­tem, we can build with the­se princi­ples a vast glo­bal dis­tri­bu­t­ed data manage­ment sys­tem, whe­re we can rep­re­sent and link data across dif­fe­rent infor­ma­ti­on sys­tems. This is not just a visi­on, but cur­r­ent­ly alrea­dy sta­red to hap­pen, some lar­ge sca­le examp­les inclu­de:

  1. the schema.org initia­ti­ve of the major search engi­nes and Web com­mer­ce com­pa­nies, which defi­ned a vast voca­bu­la­ry for struc­tu­ring data on the Web (and is used alrea­dy on a lar­ge and gro­wing sha­re of Web pages) and uses Git­Hub for col­la­bo­ra­ti­on on the voca­bu­la­ry
  2. Initia­ti­ves in the cul­tu­ral heri­ta­ge domain such as Europeana, whe­re many thousands of memo­ry orga­ni­za­ti­ons (libra­ries, archi­ves, muse­ums) inte­gra­te and link data describ­ing the arti­fac­ts.
  3. the Inter­na­tio­nal Data Spaces Initia­ti­ve, aiming to faci­li­ta­te the dis­tri­bu­t­ed data exchan­ge in enter­pri­se value net­works thus estab­li­shing data sover­eig­n­ty for enter­pri­ses.

Text by: Prof. Dr. Sören Auer, Technische Informationsbibliothek, L3S Research Center

Sören Auer's blog on tib.eu: Link

Sören Auer's arti­cles on Lin­kedIn: Link