How to Scrap LinkedIn Company Data using Voyager Api

How to Scrap LinkedIn Company Data using Voyager Api

What is LinkedIn Scrapping?

LinkedIn scraping is the process of extracting data from LinkedIn's website using automated scripts or tools. It involves collecting information from profiles, company pages, job postings, and other publicly available data.

On the other hand, web scraping involves automatically gathering data from websites by sending HTTP requests and extracting specific information from HTML content.

Note: However, violating LinkedIn's terms of service can lead to account suspension, IP blocking, or legal action.

Reviewing the website's terms of service and API usage guidelines is essential to avoid legal or ethical issues.

What is Voyager's REST API?

Voyager's REST API enables developers to integrate enterprise applications with the geospatial search platform. It provides a web-based interface for exploring and interacting with Voyager's data and services.

Voyager offers two types of open interfaces: XML over HTTP and RESTful APIs. LinkedIn developed Voyager-API, a new API service, to provide a more resilient platform for web and mobile applications, based on the Play framework and GraphQL query language.

How to Obtain JSESSIONID and CSRF Token:

JSESSIONID: When a user accesses a website, the web server generates a unique identifier called the JSESSIONID.

  • It is often kept in the web browser used by the user as a cookie and sent along with future requests to identify the user's session.

  • The JSESSIONID may be obtained by viewing the cookies in your browser's developer tools when visiting the page.

  • Look for a cookie with the string "JSESSIONID" or anything similar. Please remember that certain websites may use distinct cookie names for session tracking.

CSRF Token: The CSRF token (Cross-Site Request Forgery token) is a security measure to prevent hostile websites from sending unauthorized requests.

  • It is frequently necessary when performing POST, PUT, or DELETE requests to web services.
  • The method for obtaining the CSRF token differs according to the website or online service. In other circumstances, the token is concealed in a web page's form and may be extracted using web scraping techniques.
  • When performing an initial GET call to a web service, the CSRF token may be sent in the response headers. In such circumstances, the token may be extracted from the response headers and used in subsequent requests as needed.

Step by Step guidelines using the Python module for performing HTTP requests.

  1. It imports the requests library, a well-known Python module for performing HTTP requests.
  1. It assigns a user-agent string to the headers variable. The user agent identifies the type of client making the server request.
  1. It defines the variable company_link, which holds the business information API endpoint URL. The URL appears to be of the form linkedin.com/voyager/api/entities/companies.., where company_id is changed with the company's particular entity ID.
  1. The script then builds requests.session() object, which enables a persistent session and cookie setting.
  1. It uses the s.cookies property to set the session's needed cookies (li_at and JSESSIONID). These cookies may be required for authentication or permission while using the LinkedIn API.
  1. The script sets the session's user-agent and CSRF token headers to simulate a standard web browser request.
  1. It uses the session object s.get(company_link) to send an HTTP GET call to the company_link URL.
  1. The API answer has been received and looks to be in JSON format.
  1. The JSON response is then processed into a Python dictionary, and the script publishes the dictionary's contents, which contain corporate information like staff count, website URL, company type, industries, description, and more.

def getdatafromvoyagerlinkedin(company_id):
    headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
               }

    company_link = f'https://www.linkedin.com/voyager/api/entities/companies/{company_id}'

    with requests.session() as s:
        s.cookies['li_at'] = "AQEDATf5D_XXXXXXXXXXXXXXu"
        s.cookies["JSESSIONID"] = "ajax:1XXXXXXXXXXXXXX0"
        s.headers = headers
        s.headers["csrf-token"] = s.cookies["JSESSIONID"].strip('"')
        response = s.get(company_link)
        response_dict = response.json()
        return response_dict


print(getdatafromvoyagerlinkedin(16198870))
# Output
#{'employeeCountRange': '11-50',
 #'specialties': ['Web Development', 
#'Mobile App Development', 'Web Design', 
#'Node', 'Flutter', 'Ionic', 'AWS', 'Digital Ocean', 'Laravel', 'WordPress', 
#'React Native', 'React JS experts', 'PHP development', 'PrestaShop', 'OpenCart',
# 'SEO and SEM', 'Joomla'],
 #'entityUrn': 'urn:li:fs_company:16198870', 
#'websiteUrl': 'https://www.bytescrum.com', 
#'companyType': 'Privately Held', 'foundedDate': {'year': 2017}, 
#'entityInfo': {'objectUrn': 'urn:li:company:16198870',
#'trackingId': 'ymxYkL2eSUSOWQIt+Mn3xQ=='}, 
#'industries': ['Information Technology and Services'],
 #'description': 'ByteScrum taps into its strong business acumen to find solutions to the unique set of challenges and constraints imposed by each new project and delivers solutions that fill performance gaps. 
#Our founders understood for the first time how good software development services can transform the needs of entire business communities, especially emerging technologies.
# We have a proven track record of successfully meeting deadlines and executing the most complex projects within budget while consistently maintaining the highest quality.\n\nOur specialities:\n\n● Mobile app development
 #(Android and iOS)\n● Web app development (MERN, MEAN, Vue JS, PHP, Laravel, WordPress)\n● Custom Software development\n● Web designing (PSD to HTML/WordPress)\n● Api integration \n● 
#CMS development\n● Web and app service integration\n● SEO and SEM services\n\nContact Us:\nhttps://www.bytescrum.com/contact-us/', 
#'basicCompanyInfo': {'headquarters': 'Lucknow',
 #'followingInfo': {'entityUrn': 'urn:li:fs_followingInfo:urn:li:company:16198870', 
#'dashFollowingStateUrn': 'urn:li:fsd_followingState:urn:li:fsd_company:16198870', 'following': False, 'trackingUrn': 'urn:li:company:16198870', 'followingType': 'DEFAULT'},
# 'miniCompany': {'objectUrn': 'urn:li:company:16198870', 'entityUrn': 'urn:li:fs_miniCompany:16198870', 'name': 'ByteScrum Technologies Private Limited',
# 'showcase': False, 
#'active': True,
# 'logo': {'com.linkedin.common.VectorImage': {'artifacts': [{'width': 200, 'fileIdentifyingUrlPathSegment': '200_200/0/1653201669588?e=1698883200&v=beta&t=GE_5HHCt3u_xxKWDV1d3KmNBx0-AJXvIyjkIxSaXp-E',
# 'expiresAt': 1698883200000, 'height': 200}, {'width': 100, 'fileIdentifyingUrlPathSegment': '100_100/0/1653201669588?e=1698883200&v=beta&t=rbIH_vzfS4YkrOV-inNhuY9XXdbj28K9l4ZY_4-I41o', 
#'expiresAt': 1698883200000, 'height': 100}, {'width': 400, 'fileIdentifyingUrlPathSegment':
# '400_400/0/1653201669588?e=1698883200&v=beta&t=rARzTyswXT1D9vObNkCAh9ljFivi4r6T0QxC_WwLVvQ', 'expiresAt': 1698883200000, 'height': 400}], 
#'rootUrl': 'https://media.licdn.com/dms/image/C4D0BAQHzTgUzh6WpUw/company-logo_'}}, 'universalName': 'bytescrum', 'dashCompanyUrn': 'urn:li:fsd_company:16198870', 
#'trackingId': 'XXXXXXXXXXXXXXXXX'}}}

LinkedIn now provides a unique ID to each company page, making it easy to discover. Once you have that, replace that number in the code with one.

linkedin.com/voyager/api/entities/companies{company id}

Conclusion

LinkedIn scraping and web scraping are two ways to obtain information from LinkedIn's website. While LinkedIn scraping is concerned with gathering publically accessible data, web scraping is concerned with mechanically pulling particular information from websites using HTTP queries and HTML parsing.

Following LinkedIn's terms of service is critical to prevent potential penalties such as account suspension or legal action. Voyager's REST API enables corporate applications to integrate with their geographic search engine seamlessly.

The JSESSIONID is a unique identification provided by a website server, whereas the CSRF token protects against unwanted requests. Both are accessible via web browser cookies, web scraping tools, or response headers.

Knowing data extraction regulations and ethical issues is critical while working with web scraping and APIs like Voyager.

If you find this post exciting, find more exciting posts on the Learnhub Blog; we write everything tech from Cloud computing to Frontend Dev, Cybersecurity, AI, and Blockchain.

Culled from ByteScrum

Resources