DBIS - Databases and Information Systems

Centralized data in the Mitarbeiter-Cluster

Dear all,

We had the problem of having multiple and duplicated datasets within HDFS at the Mitarbeiter-Cluster. This could be a potential cause of problems.

Therefore, I created a unified data folder inside HDFS (/user/allDatasets). I moved all data I could find in HDFS to that folder and deleted obvious duplicates. This folder contains three folders:

/user/allDatasets/original: this is where datasets which are downloaded from the web should be placed. Examples are dbpedia or any other LoD dataset as serialized triples (nt, ttl, n3), csv files, txt files, or any data generated benchmarks (synthetic data).

/user/allDatasets/parquet: for datasets stored in parquet format, which don't appear as a Hive database.

/user/allDatasets/results: in the process of computing job, you might want to generate some results. These are stored in files whose name is looks like "part-00000".

You can take a look to the files in there and delete something if 1) you are sure it is no longer required and 2) you have the rights to do so. Otherwise simply let me know what to delete. Entry point: http://dbisma01.informatik.privat:8888/filebrowser/view=/user/allDatasets#/user/allDatasets

Whenever you need to put data within HDFS please check whether this is already inside "original" or even better inside "parquet".

In addition to this folder there is also the hive metastore /user/hive/warehouse http://dbisma01.informatik.privat:8888/filebrowser/view=/user/allDatasets#/user/hive/warehouse This contains also a large number of databases. There I only deleted empty databases. Again, if you are sure something is not needed anymore you can simply delete it or let me know.

Best,

Anthony