ctenopharyngodon-idella

Repository Introduction

Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities

The widely used MapReduce distributed crawler still recommends using Jsoup, but it cannot parse data loaded by JavaScrip Therefore, this is a warehouse that utilizes Fast Json to crawl data information from all Chinese universities, utilizing the Map Reduce distributed computing crawler in the Hadoop ecosystem At present, my programming environment is Windows10, and virtual Hadoop cannot be tested on Linux or Mac in the testing environment of Windows10. It is currently determined that Linux is an HDFS path. If you are interested, please submit Issues or Pr.

This repository contains：

Building a simulated distributed environment under Windows
Crawling 掌上高考
Data Storage

Install

This project uses Java Git, Go check them out if you don't have them locally installed.

git clone https://github.com/weiensong/ScrapySchoolAll.git

Usage

A truly distributed environment

mvn package

# hdfs in Master
hadoop jar PackageName.jar

Distributed environment simulated by Windows

run initTest.bat directly as administrator

  cd /d "%~dp0"
  copy hadoop.dll C:\Windows\System32
  cd /src/main/java/job
  javac MyJob.java
  java MyJob

Related Repository

hadoop —Apache Hadoop
opsariichthys-bidens — Basic information API construction of Chinese national universities.(中国全国大学基本信息API搭建)

Related Efforts

Maintainers

@weiensong

Contributing

Feel free to dive in! Open an issue or submit PRs.

Standard Java follows the Google apache Code of Conduct.

Contributors

This project exists thanks to all the people who contribute.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.about		.about
.idea		.idea
.withDb		.withDb
InputFile		InputFile
src/main/java		src/main/java
target/classes/util		target/classes/util
LICENSE		LICENSE
README.md		README.md
hadoop.dll		hadoop.dll
init.sql		init.sql
initTest.bat		initTest.bat
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ctenopharyngodon-idella

Repository Introduction

Install

Usage

Related Repository

Related Efforts

Maintainers

Contributing

Contributors

License

About

Releases

Packages

Languages

License

touero/ctenopharyngodon-idella

Folders and files

Latest commit

History

Repository files navigation

ctenopharyngodon-idella

Repository Introduction

Install

Usage

Related Repository

Related Efforts

Maintainers

Contributing

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages