Sales Lead Generation Using Spark and Twitter Stream

Vivien Holecz and Lili Aunimo, Haaga-Helia University of Applied Sciences
Published 28.2.2018


Automatic creation of sales leads is one important business application that uses big data. It belongs to the broader field of customer acquisition. Data-driven methods that aid in customer acquisition is an under researched field (D'Haen and Van den Poel, 2013). This study that has been done in the research project Big Data – Big Business is an attempt to both clarify the requirements of an automatic sales lead generation system and to gain initial experiences from building such a system.

Existing sales lead generation systems typically mine information from large semi- or unstructured datasets such as different social media sources and digital media such as newspapers, official information forums and company web sites. Examples of social media sources include the Twitter feed, Facebook posts and different professional discussion forums. The results of the automatic generation of sales leads consist of a list of potential companies with or without the potential contact person in the company as well as a text snippet containing the reasons for being selected to the result list. In the optimal case, this text snippet contains information about the specific products that the company might be willing to purchase.

In the context of the Big Data – Big Business -project, we built a prototype for automatically finding new potential sales leads by listening to the Twitter feed. In addition to gaining experiences from data mining based on the Twitter feed, we also wanted to gather experiences from using Spark. Spark is an open source cluster computing framework for processing big data. It can be programmed using various APIs (Application Programming Interface). The Spark Core contains various extensions such as Spark Streaming for processing stream data (such as the Twitter stream) and Spark MLib for applying machine learning algorithms. Most commercial big data processing tools contain Spark. Examples of these are IBM's Bluemix and Microsoft Azure.

We built our own tool that listens to the Twitter feed, filters relevant tweets based on keywords and then visualizes the results in real-time on a web page. We benchmarked commercial media intelligence tools from Meltwater and Mbrain in order to see how well they perform in automatic generation of sales leads. The advantage of building the system from scratch ourselves using only open source software is that we have full control over the system and we know exactly why it produces the results it does. The implementation was produced as a thesis work in Haaga-Helia University of Applied Sciences.

Technical details of the implementation

Caption: Figure 1. Components of the implementation.

As Figure 1 suggests, two cloud services were used for the implementation. Processing started in Haaga-Helia’s own cloud environment provided by CSC running Ubuntu 16.04 operating system. Apache Spark was installed in this environment. There are four programming languages to operate Spark: Scala, Java, Python and R. Python was chosen due to personal experience. In order to communicate with the Twitter API a small Python script was written using the Twython library. The API streams the tweets as JSON objects and there is detailed documentation on the attributes and their meaning. The Twython script handles the authentication with the Twitter API, streams tweets based on some filtering information like language and location, strips the JSON objects of unnecessary attributes and forwards them to Spark through a socket connection. The script to operate Spark is responsible for scanning each tweet for keywords using regular expressions and saving the matching tweets into a MySQL database.

The MySQL database is located in a Digital Ocean droplet, because the thesis worker intended to retain access to it even after the project was completed. Besides the database, the website is also hosted from this server. The structure of the MySQL database is very simple. There are separate tables for each search keyword, three in total. When Spark finds a matching tweet it saves it into the corresponding table. The table contains not only the tweet text, but also the user name of the creator, the city it was posted in, and a time stamp. The Flask framework was used for the web application in order to be able to continue using Python. The Flask application consists of Python scripts and HTML templates. The HTML templates contain the basic structure of the pages and the JavaScript program for the visualizations using the D3.js library. The Python script reads the data from MySQL and passes them into the templates to create the actual HTML files.

Experiences from the Implementation Process

Several challenges were met during the implementation. Because the default language of Spark is Scala, some functionalities are missing from the Python API. The first obstacle in the development was that the Python API doesn’t have built-in Twitter support as opposed to the Scala API. On the upside, there are several Python libraries which can stream data from the Twitter API. Two libraries were tested, Tweepy and Twython. Tweepy had some bugs which resulted in unexpected shut downs, so Twython was used in the end.

The string encoding issues of the Python 2.7 version also caused hours of troubleshooting, especially due to the German characters ä and ü, which are not recognized by the ASCII codec. The solution was to decode byte strings to Unicode strings upon receiving from the source and only encode them again when forwarding to another program.

Due to the installation environment of Spark, the streaming shut down regularly after running for a few hours. It is suspected that the insufficient RAM caused the Twitter buffer to overload and the Twitter API to disconnect. This can only be solved by installing Spark on a distributed system.The final and most significant issue is that no relevant tweets were found, because the simple keyword search mostly resulted in irrelevant findings. All characteristics of the difficulties of natural language processing were encountered: metaphors, homonyms and contextual differences skewed the findings. When trying to find tweets about the construction industry with keywords like “construction” and “building”, the findings included: “Even at the age of 6 girls already know less about careers in construction than boys.” and “Omg how pretty is this building #london is full of beautiful treasures”. The lesson learned is that natural language processing is indeed a complex problem and a simplistic approach such as keyword filtering is insufficient. For this reason, the next steps in the project could be refining the filtering in Spark with more sophisticated natural language processing techniques, possibly involving machine learning as well.


Sales lead generation is a complex task that requires natural language processing techniques. By building our own prototype we gained valuable experiences from what it takes to build a prototype from scratch by using open source software. We also gained insight about the requirements of sales and marketing departments with regard to such a tool. We managed to build a tool that listens to the Twitter stream, performs a keyword-based search on it and visualizes the results on a web page in real time. The next steps in the project are twofold: Firstly, the system has to be made more robust. The technical reasons behind the Spark Streaming shutdown have to be investigated and fixed. Secondly, the searches have to be configured more carefully in order to meet the requirements of the sales lead generation challenge. This can be done by taking full advantage of the Twitter API and by introducing natural language processing tools into the system.


Apache Software Foundation. 2017. Apache Spark – Lightning-fast cluster computing. URL: [last accessed: 22.02.2018]

Apache Software Foundation. 2017. Apache Spark – Lightning-fast cluster computing. URL: . [last accessed: 22.02.2018]

Mike Bostock. 2017. D3 – Data Driven Documents. URL: [last accessed: 22.02.2018]

Jeroen D’Haen, and Dirk Van den Poel, 2013. Model-supported business-to-business prospect prediction based on an iterative customer acquisition framework. Industrial Marketing Management 42, 544–551.

Vivien Holecz. Big Data solutions for market intelligence for a B2B company. Bachelor's thesis, Haaga-Helia University of Applied Sciences, 2017.

Mbrain. Media Intelligence Solutions. Company's Website. [last accessed 20.2.2018]

Meltwater. Media Intelligence Solutions. Company's Website. [last accessed 20.2.2018]

Armin Ronacher. 2018. Flask. Web development, one drop at a time. URL: [last accessed: 22.02.2018]

Twitter Inc. 2018. Twitter Developers. API Reference Index. URL: [last accessed: 22.02.2018]

Ryan McGrath. 2013. Twython 3.6.0 documentation. URL: [last accessed: 22.02.2018]


The authors wish to acknowledge CSC – IT Center for Science, Finland, for computational resources.