Recommended articles:
-
Global Energy Interconnection
Volume 2, Issue 1, Feb 2019, Pages 78-84
Accurate querying of frequent subgraphs in power grid graph data
Abstract
With the development of information technology, the amount of power grid topology data has gradually increased.Therefore, accurate querying of this data has become particularly important.Several researchers have chosen different indexing methods in the filtering stage to obtain more optimized query results because currently there is no uniform and efficient indexing mechanism that achieves good query results.In the traditional algorithm, the hash table for index storage is prone to “collision” problems, which decrease the index construction efficiency.Aiming at the problem of quick index entry, based on the construction of frequent subgraph indexes, a method of serialized storage optimization based on multiple hash tables is proposed.This method mainly uses the exploration sequence to make the keywords evenly distributed; it avoids conflicts of the stored procedure and performs a quick search of the index.The proposed algorithm mainly adopts the “filterverify” mechanism; in the filtering stage, the index is first established offline, and then the frequent subgraphs are found using the “contains logic” rule to obtain the candidate set.Experimental results show that this method can reduce the time and scale of candidate set generation and improve query efficiency.
1 Introduction
In power information systems,power grid devices form a large network topology.With the rapid construction and in-depth development of the global energy internet,the topological structure and operation mode of the power grid are becoming increasingly complex.Power grid equipment presents a new trend of rapid growth,change,and diversification; the asset management level of power grid equipment continues to increase.Therefore,higher requirements are desired for the query and analysis service function of frequent subgraphs in similar topology analysis systems.
With continuous breakthroughs in new computer technologies such as cloud computing and big data,a series of non-relational,distributed NoSQL data storage systems have begun to replace relational databases in some applications to achieve better performance [1-4].A graph database is an important NoSQL data storage system.It organizes internally stored data into a logical model of attribute maps,and separates the relationships as in relational databases into attribute values of edges of attribute maps that are directly connected to entities; it replaces traditional relational operations with topological operations,thereby,greatly improving nested query execution efficiency.Currently,graph databases have attracted attention in the power system field [5-8].As wellknown,the complex network topology of a power grid is modeled and stored in a graph database.Full utilization of the graph computing capabilities of the graph database will be beneficial for the power grid information system.The graph database solidifies and stores relationships between entities; this is particularly suitable for flexible and efficient storage and management of relation-intensive power system data.Currently,there is no uniform and efficient indexing mechanism that can achieve good query results.Therefore,many researchers have chosen different indexing methods in the filtering stage to obtain more optimized query results [9].This study is based on the use of hash tables for index storage in traditional algorithms,which are prone to “conflict” problems and result in reduced index construction efficiency.Aiming at fast index entry,based on the construction of frequent sub-graph indexes,a method of serialized storage optimization based on multi-hash table is proposed.This method mainly uses the exploration sequence for evenly hashed keywords.Each position in the table avoids conflict in the stored procedure and implements a quick search for the index.Experiments on real data sets are performed to verify the effectiveness of the algorithm.
The remainder of the paper is organized as follows:section 2 describes related research all over the world.Section 3 analyzes the algorithm and flow of multi-hash coding.Section 4 gives the analysis of experimental results obtained.Section 5 concludes the paper.
2 Related work
In recent years,many methods have emerged for frequent subgraph queries in graph databases.Molnar et al.presented relational database and semi-structured data modeling based on conceptual graphs [10].It also discusses a conceptual graph-based query design for the relational data model and XQuery.The novelty of this work is an intuitive graphical web application,which represents the XML data structure in the form of conceptual graphs and leads to the possibility of constructing queries in form of conceptual graphs on the selected data structure.Zhang et al.introduced a novel algorithmic framework for pathway similarity search,called PathEmb(Pathway Embedding)[11]; PathEmb is a topology-free pathway similarity search algorithm,which is feasible for handling any pathway with an arbitrary structure.The experimental results demonstrate that PathEmb outperforms existing methods in terms of computational efficiency and search accuracy.Sun et al.[12]presented a fast top-k graph similarity search algorithm with high classification accuracy.It introduces a new graph similarity measure based on the number of occurrences of subtree patterns in graphs.It also can scale to large data sets including 15 million chemical structure graphs.Anna et al.[13]presented a method for parallel speculative query execution support to be applied to relational database systems.The method is based on the dynamic analysis of an input query stream in databases serviced in SQLite.A special representation of queries in the form of multigraphs is employed.They aimed at finding the best candidate queries for speculative execution,which fit currently accumulated query window.Ding et al.[14]proposed an index structure named Closure+-tree to process the subgraph query efficiently.During the processing of subgraph queries,all graph descendants will be pruned if their union does not contain the query graph.Experimental results revealed that their index structure can prune up to 80% unqualified graphs with variable size of queries.
3 Basic principle of multi-hash encoding
A hash table is a form of hash expression.Its function is to convert an input of any length into an output of a fixed length,called hash value,through a hashing algorithm.In other words,it can transform data of any length into data of a fixed length,which is changing indefinite length data to fixed-length data in the mapping process.Hash tables have the advantage of directly mapped transformations; however,they suffer from drawbacks.Different keywords may output the same hash address after applying the hash function,that is,This phenomenon is called address conflict,also referred to as conflict.Keywords with the same function value are synonymous to the hash function.Therefore,to obtain a hash address,a set of keywords may be mapped to a limited contiguous memory address space(interval)according to hash function H(key)combined with a processing conflict method.The table created is called a hash table,the mapping process is called hashing,and the resulting storage location is called the hash address [15].
3.1 Basic concepts of database query
To understand the basics of multi-hash coding,we first introduce concepts related to subgraph queries on power grid graph databases.
Definition 1 [16].A label-determined graph is represented by the five-tuplewhere V represents the set of vertices in the graph,E represents the set of determined edges of the graph,is the set of labels that determine the vertices of the graph,is the set of labels that determine the edges of the graph,and L is an assignment function that assigns vertex labels and edge labels EΣ to the corresponding vertices and edges.In graph G,the set of vertices is determined by V(G)and the set of edges is determined by E(G).Here,represents the number of vertices in graph G and represents the number of edges in graph G.
Definition 2 [17].The subgraphs are isomorphic,that is,for two deterministic graphs and if there is a single shot θ:V → V',the following conditions will be satisfied:
(1)For anyu V∈ ,there is
(2)For any(u,v) ∈ E,there is
(3)For any(u,v) ∈ E,there is
Then,θ is the isomorphism of subgraphs G and has subgraph with then subgraph is called the embedding of G in G' under subgraph isomorphism θ.
Definition 3 [18].Frequent subgraphs have graphs g and g',where g ∈ g'; if there are isomorphic subgraphs from g to 'g,and g support(number of occurrences of g as a subgraph in graph database D)is not less than the minimum support degree min_sup,which is called a frequent subgraph.
Definition 4 [19].A hypergraph query is a given graph query q.The hypergraph query finds all the graph sets Dq that contain q in database that is,.
3.2 Multi-hash coded index
In a grid graph database,frequent subgraphs are used as features to extract the picture segments with decisive effects; they are coded by the depth first search(DFS)[20]and organized in a prefix tree.Multiple hashing methods are used to map the codes in the tree,store them in the hash table structure,use the query graph q to search for feature items in the filtering process,and generate a candidate set Cq that is as small as possible.Then,q and Cq subgraph isomorphism verification match,get the hypergraph query result set.
Using a multi-hash probe sequence,the keywords can be evenly hashed at each position in the table,thereby,effectively avoiding conflicting stored procedures and enabling fast index lookup.The multi-hash method is one of the better ways to resolve conflicts in the open addressing method.Its probe sequence is as follows:
Equation(1)indicates that when a collision occurs,the key value is first mapped using the second hash function.If there still exists a conflict after mapping,the key value is mapped using the third hash function.Otherwise,the value is saved directly.According to the characteristics of multiple hashes,this study designs an algorithm that maps DFS codes to hash tables,as shown in Algorithm 1.In the algorithm,according to the input DFS encoding,a hash value is used to determine whether there is a conflict; if there is no conflict,the corresponding key value is directly stored,otherwise,the function is used for address exploration and comparison until an available address is found.Finally,the multi-hash code is returned.The algorithm is shown as follows.
Algorithm 1.Multi-hash encoding Input:DFS encoding Output:mhcode 1.key DFScode ←2.hashtable φ←3.where not EOF hashtable 4 . 1( )t hkey=5.If t does not conflict 6.then mhcode t←7.else {for(int i =1;i<=n;i++){ n =∑ }}8.If t still has conflicts 9.then t hkey m i( )%i=1 n ∑ ,where j is a natural number 10.next key 11.end 12.return mhcode t hkey j m=[ ( ) ]%i +i=1
Fig.1 DFS prefix tree
Here,some keywords {19,28,29,48,99} are taken as examples to map to a hash table with a spatial size of 10.For even distribution throughout the table,we define a hash function as follows based on the multi-hash method:
where Ri is a prime number less than the length of the hash table.For the first case,we chooseIf there is still a conflict,select and so on.If there are very few situations in which there is a conflict,the next approach is followed.
where j is a natural number.
Table 1 Hash table storage using the multi-hash method
Insert 99 0 29 29 29 Empty table Insert 19 Insert 28 Insert 29 Insert 48 1 2 3 4 48 48 5 99 6 7 28 28 28 28 9 19 19 19 19 19 8
According to we can store 19 and 28.The first conflict occurred when 29 was inserted,Therefore,29 wasinserted at position 0.The second conflict occurred when 48 was inserted,Therefore,48 was inserted at position 4.The third conflict occurred when 99 was inserted,
Therefore,99 was inserted at position 5.At this point,the three conflicting keywords were well mapped to the hash list.
3.3 Query processing
Based on the previous work,further filtering is required.In the filtering process,if q is frequent,the hash encoding can be quickly located directly from the feature set F,and the candidate set can be quickly filtered out [21-28].If q is infrequent,according to “including logic”:If subgraph f of q exists in the feature set,and f is a feature fragment in the feature set,the hypergraph of f can be found directly through the DFS prefix tree,which must satisfyTherefore,by comparing the hash code of the query fragment xi and the feature fragment fi,Algorithm 2 can directly locate the index segment and determine its candidate set.For a sub-picture segment that does not have a key value in the hash table,it is not necessary to detect a super-picture set; its structure must not belong to the candidate set.Finally,the candidate set obtained by using the feature segments of each length is used as an intersection operation to further narrow the candidate set range and reduce the candidate set size.
In Algorithm 2,len(xi)refers to the length of fragment xi,Dxi refers to the supergraph set of fragment xi, and mhcode(x)refers to the hash encoded value of x.
Algorithm 2.Obtaining candidate set Input:Feature set fi,query graph q,maximum feature fragment length maxlength,graph data set D Output:mhcode wq 1.qw D←2.for fragment ix q⊆ and () max lenx l≤ 3.if () ()i ==4.then q q xi mhcodex mhcodef i i w w D= ∩5. () ()1 lenx lenx= +6.end 7.return wq;i i
4 Experimental results and analysis
The experimental hardware environment was a server with Intel Xeon E7-4809 processor(48 core,1.90 GHz),128 GB memory,and the operating system was CentOS6.5.It was optimized to maximize the speed.
The experimental data were based on real power grid topological data from one power company,including 30 million nodes and 80 million edges.Based on graph theory,the data structure of the real power grid topology was defined.As shown in Fig.2,the power grid equipment is described as “node.” The relationship between each equipment is described as “edge.” The meanings of “node” and “edge” can be expressed by “attribute.”
In a real data set,the size Q of the query graph is set to 2 K,4 K,5 K,8 K,and 10 K.The size of the query graph and the filtering effect in the graph database were used to compare the respective filtering performances.As can be seen from Fig.3a,the multi-hash encoding method has better filtering effect than the traditional encoding method; that is,a smaller size candidate set can be obtained.
Fig.2 Real power grid topology data structure
In the index construction algorithm of graph features,the traditional method generally uses the hash table structure to store index encoding.A hash function in the stored procedure is designed by using a contiguous unit with a larger range of the following table,mapping each code so that it has a function value corresponding to it in the hash table,and using the array unit of the value to store the corresponding code.However,this method does not guarantee that the keyword of each element has a one-to-one correspondence with the function value.It is very likely that the same function value is calculated for different elements; thus,a conflict occurs,that is,two different key values pass through.The index position calculated by the hash function is the same,resulting in a long filtering phase,and a large obtained candidate set size.Multi-hash encoding avoids this conflict and efficiently stores the code.
Analog data is more controllable,and multiple sets of data can be selected for comprehensive comparison.As shown in Fig.3b,when the size of the query graph is small,the multi-hash encoding method has little difference in terms of efficiency.As the query graph size increases,the conflict rate of the multiple hashes in the query mapping process is greatly reduced,and the superiority of the method compared with the traditional method can be clearly observed.
Fig.3a Comparison of filtering effects of candidate graphs
Fig.3b Comparison of filtering effects of candidate graphs
5 Conclusion
Hypergraph querying has a strong practical application in power grid graph database.In the graph query process,the “filter-verify” mechanism is widely used.Index construction and query algorithm,in which index construction needs to be considered along with index filtering capabilities,size,index updates and other factors,has been widely studied.To improve the querying efficiency,this study proposed a multi-hash storage index coding method based on the index of frequent subgraphs to improve the efficiency of the candidate set generation in the “filtering” stage; thus,the time consumption of the entire query process was reduced.Experimental results showed the effectiveness of the proposed method.
Acknowledgements
This work was supported by the State Grid Science and Technology Project(Title:Research on High Performance Analysis Technology of Power Grid GIS Topology Based on Graph Database,5455HJ160005).
References
-
[1]
Andrea E,Viorica V,Christian S et al(2017)Conceptual graph driven modeling and querying methods for RDMBS and XML databases.In:13th Intelligent Computer Communication and Processing(ICCP),Cluj-Napoca,Romania,2017:55-62 [百度学术]
-
[2]
Pavel S,Jiri H,Pavel M et al(2018)Performance testing of NoSQL and RDBMS for storing big data in e-applications.In:3rd International Conference on Intelligent Green Building and Smart Grid(IGBSG),2018:1-4 [百度学术]
-
[3]
Peter A,Blimat A(2018)Knowledge-based fast web query engine using NoSQL.In:6th International Symposium on Digital Forensic and Security(ISDFS),2018:1-5 [百度学术]
-
[4]
Shady H,Zurinahni Z(2017)Document-Oriented Data Schema for Relational Database Migration to NoSQL.In:International Conference on Big Data Innovations and Applications(Innovate-Data),2017:43-50 [百度学术]
-
[5]
Yang C,Li Z,Qu W et al(2017)Grid-Based Indexing and Search Algorithms for Large-Scale and High-Dimensional Data.In:2017 14th International Symposium on Pervasive Systems,Algorithms and Networks & 2017 11th International Conference on Frontier of Computer Science and Technology & 2017 Third International Symposium of Creative Computing(ISPAN-FCSTISCC),2017:46-51 [百度学术]
-
[6]
Liu G,Liu K,Shi D et al(2017)Graph Computation and Its Applications in Smart Grid.In:2017 IEEE International Congress on Big Data,2017:507-510 [百度学术]
-
[7]
Ma Y,Wu Z,Lin G et al(2014)Study on the relationship between transmission line failure rate and lightning information based on Neo4j.In:2014 International Conference on Power System Technology,2014:474- 479 [百度学术]
-
[8]
Alnaymat G(2013)GCG:Mining Maximal Complete Graph Patterns from Large Spatial Data.In:2013 ACS International Conference on Computer Systems and Applications(AICCSA),2013:1-8 [百度学术]
-
[9]
A John,M Sugumaran,R Rajesh(2017)Performance analysis of the past,present and future indexing methods for spatio-temporal data.In:2017 2nd International Conference on Communication and Electronics Systems(ICCES),2017:645-649 [百度学术]
-
[10]
Andrea M,Viorica V,Christian S et al(2017)Conceptual graph driven modeling and querying methods for RDMBS and XML databases.In:2017 13th IEEE International Conference on Intelligent Computer Communication and Processing(ICCP),2017:55-62 [百度学术]
-
[11]
Zhang J,Kwong S,Liu G et al(2018)PathEmb:Random Walk based Document Embedding for Global Pathway Similarity Search.In:IEEE Journal of Biomedical and Health Informatics,2018:1-1 [百度学术]
-
[12]
Sun Z,Huo H,Chen X et al(2018)Fast Top-K Graph Similarity Search Via Representative Matrices.IEEE Access,2018:21408-21417 [百度学术]
-
[13]
Anna S,Marek T(2017)Graph-Based Speculative Query Execution in Relational Databases.In:2017 16th International Symposium on Parallel and Distributed Computing(ISPDC),2017:122-131 [百度学术]
-
[14]
Ding X,Ou Y,Jia J et al(2017)Efficient Subgraph Search on Large Anonymized Graphs.In:2017 International Conference on Green Informatics(ICGI),IEEE Computer Society,2017:223-228 [百度学术]
-
[15]
Zhu Y,Yang S,Gan G et al(2018)An Efficient Retrograde Storage for Self-Destructing Messages on Frequently Colliding Hash Table.In:IEEE 42nd Annual Computer Software and Applications Conference(COMPSAC),2018:914-919 [百度学术]
-
[16]
Dai J,Zhou Z,Yao Z et al(2017)Cyber physical power system modeling and simulation based on graph computing.In:2017 IEEE Conference on Energy Internet and Energy System Integration(EI2),2017:1-6 [百度学术]
-
[17]
Bonnici V,Giugno R(2017)On the variable ordering in subgraph isomorphism algorithms.IEEE/ACM Transactions on Computational Biology and Bioinformatics,14(1):193-203 [百度学术]
-
[18]
Petermann A,Micale G,Bergami G et al(2017)Mining and Ranking of Generalized Multi-Dimensional Frequent Subgraphs.In:12th International Conference on Digital Information Management(ICDIM),2017:236 -245 [百度学术]
-
[19]
Sen S,Ghosh M,Dutta A et al(2015)Hypergraph Based Query Optimization.In:International Conference on Computer Communication and Informatics,2015:1-8 [百度学术]
-
[20]
Joolee J,Lee Y(2018)Video Retrieval Based on Image Queries Using THOG for Augmented Reality Environments.In:2018 IEEE International Conference on Big Data and Smart Computing(BigComp),IEEE Computer Society,2018:557-560 [百度学术]
-
[21]
Hu W,Fan Y,Xing J et al(2018)Deep Constrained Siamese Hash Coding Network and Load-Balanced Locality-Sensitive Hashing for Near Duplicate Image Detection.IEEE Transactions on Image Processing,27(9):4452-4464 [百度学术]
-
[22]
Peng Y,Li H,Cui J et al(2018)Towards Secure Approximate k-Nearest Neighbor Query Over Encrypted High-Dimensional Data.IEEE Access,2018:23137-23151 [百度学术]
-
[23]
Sapkota M,Shi X,Xing F et al(2018)Deep Convolutional Hashing for Low Dimensional Binary Embedding of Histopathological Images.IEEE Journal of Biomedical and Health Informatics,pp(99):1-1 [百度学术]
-
[24]
Huang Y,Duan L,Wang Z et al(2017)A Multi-Block N-ary trie structure for exact r-neighbour search in hamming space.In:2017 IEEE International Conference on Image Processing(ICIP),2017:1117-1121 [百度学术]
-
[25]
Shen Y,Liu L,Shao L et al(2017)Deep Binaries:Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval.In:2017 IEEE International Conference on Computer Vision(ICCV),2017:4117-4126 [百度学术]
-
[26]
Zuo P,Hua Y(2018)A Write-Friendly and Cache-Optimized Hashing Scheme for Non-Volatile Memory Systems.IEEE Transactions on Parallel and Distributed Systems,2017:985-998 [百度学术]
-
[27]
Reato T,Demir B,Bruzzone L(2017)Primitive cluster sensitive hashing for scalable content-based image retrieval in remote sensing archives.In:2017 IEEE International Geoscience and Remote Sensing Symposium(IGARSS),2017:2199-2202 [百度学术]
-
[28]
Matsui Y,Yamasaki T,Aizawa K(2018)PQTable:Nonexhaustive Fast Search for Product-Quantized Codes Using Hash Tables.IEEE Transactions on Multimedia,2017:1809-1822 [百度学术]
Fund Information
supported by the State Grid Science and Technology Project (Title: Research on High Performance Analysis Technology of Power Grid GIS Topology Based on Graph Database, 5455HJ160005);
supported by the State Grid Science and Technology Project (Title: Research on High Performance Analysis Technology of Power Grid GIS Topology Based on Graph Database, 5455HJ160005);