INFLUENCE ANALYSIS ON TWITTER DATASET AND GRAPH VISUALIZATION USING GEPHI
Welcome to our blog on Influence Analysis on Twitter and Graph Visualization using Gephi. Social media platforms like Twitter have become an indispensable part of our daily lives. With over 330 million active users, Twitter is an excellent platform for gathering information and exchanging opinions on various topics. In this blog, we will explore the concept of influence analysis on Twitter and how it can be visualized using the Gephi tool. We used a massive dataset consisting of over 11 million nodes and 85 million edges to conduct our analysis. This data set contains information on the friendship and followership network among the bloggers on Twitter. You can access the dataset here:
https://www.kaggle.com/datasets/mathurinache/twitter-edge-nodes
we have 7 steps in this process refer them below.
Importing required packages and loading the dataset
Before we dive into the analysis, we need to import the required packages in Python.
next we will load the Twitter dataset into a panda DataFrame and converts it into a dictionary. The nodes data is loaded from the file /content/drive/MyDrive/sna-dataset/nodes.csv while the edges data is loaded from /content/drive/MyDrive/sna-dataset/edges.csv. The resulting Data Frames contain information about the users and their relationships on Twitter. This data is used to perform influence analysis and create a graph using the networkx package.
After this we will check for missing values and duplicates in the edges Data Frame using isna() and duplicated() functions.
Creating a networkx graph object
Later confirming that the edges data is clean, we can begin creating a Graph object from the edges data using the from_pandas_edgelist () function in networkx. we read the edges data from the file /content/drive/MyDrive/sna-dataset/edges.csv in chunks of 1,000,000 rows using the pd.read_csv() function. The data is then used to create a networkx Graph object using the nx.from_pandas_edgelist() function. The source and target columns of the Data Frame are used as the source and target nodes for each edge in the graph. This method is used to create the graph efficiently for large datasets, where loading the entire dataset at once may be impractical due to memory constraints.
Finding the number of communities or clusters in graph
After constructing the graph, we may utilise community detection algorithms to find clusters of strongly linked nodes that serve as the network’s communities. The community_louvain package’s implementation of the Louvain technique for community detection was employed in this instance. Each node is assigned to a community via the best_partition function, which then produces a dictionary with the community ids as the values and the node ids as the keys. We can determine how many communities were found in the network by printing the quantity of distinct community ids.this is the implementation of the code.
Finding the number of nodes in each cluster and edge density in clusters
Now that the Twitter network has been divided into communities using the Louvain algorithm, we counted the number of nodes in each community using the Counter class from the collection’s module. Using a for loop that iterates through each community and its matching size, the result is written out. We can better understand Twitter communities’ organisation and dispersion with the use of this information. we also sorted the communities by size in decreasing order and printed out top 10 communities with highest number of nodes.
And also when we come to edge density we calculated within each community using the nx.density() function. The code loops through each community to compute the density and stores the results in a list of tuples. The list is then sorted in descending order based on density, and the top 10 communities are printed out. By doing so, we can identify the most tightly connected communities in the network, which could provide insights into how information spreads within the network.
Computing degree centrality and also identifying top 10 influential nodes in each cluster
The next step is to compute the degree centrality for each node in each community. We achieved it by creating a subgraph for each community using a for loop that iterates through each community and its corresponding nodes. The degree_centrality() function from the NetworkX library is then used to calculate the degree centrality of each node in the subgraph. The output is a dictionary listing each node’s degree of centrality inside each community. This data gives us an understanding of which nodes are most interconnected throughout each community and can be used to locate important participants and network influencers.After we computed the degree centrality for each node in each community using the NetworkX degree_centrality function we sorted the nodes by degree centrality in descending order and printed out the top 10 nodes for each community.
Finding the influential nodes in whole network using betweenness centrality
The betweenness centrality of nodes is calculated using the NetworkX library’s betweenness_centrality() function. The function takes the graph object as input along with other optional parameters such as the number of nodes to use for sampling and the random seed. After computing the betweenness centrality scores for each node in the graph, a for loop is used to print out the top 20 nodes with the highest betweenness centrality. This information helps identify the most important nodes in the network that are crucial in maintaining the connections between different communities and can also serve as potential key players and influencers.
Visualizing the graph using Gephi
Now we are done with analysis of graph and we also founded the top 20 influential nodes in the whole network so it’s time to export the graph to Gephi, a popular visualization software for network analysis. To export the graph, we use the write_graphml() function from the NetworkX library. This function exports the graph in GraphML format, a popular format for representing graphs. By exporting the graph, we can then import it into Gephi and create visualizations to gain a better understanding of the network’s structure and interactions.
Once the graph is exported to Gephi, we can use the software’s advanced visualization tools to analyse and interpret the data. We can use different layouts to visualize the network, such as force-directed, circular, and hierarchical layouts and I used forceatlas-2 layout. We can also use color-coding and sizing of nodes to represent different properties of the nodes, such as betweenness centrality, degree centrality, and communities. you can refer to the pictures of visualisation of the graph I had done below.
In conclusion, we have gotten insights into the network’s structure and identified important individuals and influential communities by using a variety of network analysis approaches to the Twitter data. We were able to perform additional structural analysis and identify trends by visualising the network in Gephi, which would have been challenging to do using only numerical analysis and in colab.