HDFS中的Java和Python API接口连接

上次介绍了HDFS的简单操作，今天进入HDFS中的Java和Python的API操作，后面可能介绍Scala的相关的。

在讲Java API之前介绍一下使用的IDE——IntelliJ IDEA ，我本人使用的是2020.3 x64的社区版本。

Java API

创建maven工程，关于Maven的配置，在IDEA中，Maven下载源必须配置成阿里云。

在对应的D:\apache-maven-3.8.1-bin\apache-maven-3.8.1\conf\settings.xml需要设置阿里云的下载源。

下面创建maven工程，添加常见的依赖

添加hadoop-client依赖，版本最好和hadoop指定的一致，并添加junit单元测试依赖。

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-common</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-hdfs</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-client</artifactId>

</dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

</dependencies>

HDFS文件上传

在这里编写测试类即可，新建一个java文件：main.java

这里的FileSyste一开始是本地的文件系统，需要初始化为HDFS的文件系统

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.junit.Test;

import java.net.URI;

public class main {

@Test

public void testPut() throws Exception {

// 获取FileSystem类的方法有很多种，这里只写一种(比较常用的是使URI)

Configuration configuration = new Configuration();

// user是Hadoop集群的账号，连接端口默认9000

FileSystem fileSystem = FileSystem.get(

new URI("hdfs://192.168.147.128:9000"),

configuration,

"hadoop");

// 将f:/stopword.txt 上传到 /user/stopword.txt

fileSystem.copyFromLocalFile(

new Path("f:/stopword.txt"), new Path("/user/stopword.txt"));

fileSystem.close();

}

在对应的HDFS中，就会看见我刚刚上传的机器学习相关的停用词。

HDFS文件下载

由于每次都需要初始化FileSystem，比较懒的我直接使用@Before每次加载。

HDFS文件下载的API接口是copyToLocalFile，具体代码如下。

@Test

public void testDownload() throws Exception {

Configuration configuration = new Configuration();

FileSystem fileSystem = FileSystem.get(

new URI("hdfs://192.168.147.128:9000"),

configuration,

"hadoop");

fileSystem.copyToLocalFile(

false,

new Path("/user/stopword.txt"),

new Path("stop.txt"),

true);

fileSystem.close();

System.out.println("over");

}

Python API

下面主要介绍hdfs，参考：https://hdfscli.readthedocs.io/

我们通过命令pip install hdfs安装hdfs库，在使用hdfs前，使用命令hadoop fs -chmod -R 777 / 对当前目录及目录下所有的文件赋予可读可写可执行权限。

>>> from hdfs.client import Client

>>> #2.X版本port 使用50070 3.x版本port 使用9870

>>> client = Client('http://192.168.147.128:9870')

>>> client.list('/') #查看hdfs /下的目录

['hadoop-3.1.4.tar.gz']

>>> client.makedirs('/test')

>>> client.list('/')

['hadoop-3.1.4.tar.gz', 'test']

>>> client.delete("/test")

True

>>> client.download('/hadoop-3.1.4.tar.gz','C:\\Users\\YIUYE\\Desktop')

'C:\\Users\\YIUYE\\Desktop\\hadoop-3.1.4.tar.gz'

>>> client.upload('/','C:\\Users\\YIUYE\\Desktop\\demo.txt')

>>> client.list('/')

'/demo.txt'

>>> client.list('/')

['demo.txt', 'hadoop-3.1.4.tar.gz']

>>> # 上传demo.txt 内容：Hello \n hdfs

>>> with client.read("/demo.txt") as reader:

… print(reader.read())

b'Hello \r\nhdfs\r\n'

相对于Java API，Python API连接实在简单。

站长网

HDFS中的Java和Python API接口连接

作者: dawei

联系我们

作者: dawei

相关文章

提升灾难恢复能力，爱数AnyBackup新品重磅上线

华云安#8226;概念篇 初探企业网络攻击面管控

保数据，防曝光，安全即时通讯 移动办公用信源豆豆

Check Point Software发现联发科芯片存在4个确点

网银互联LinkW#8203;AN，智慧无人工厂处理方案

为超高清时代而生，深信服EDS存储靠什么？

联系我们

华云安#8226;概念篇初探企业网络攻击面管控

保数据，防曝光，安全即时通讯移动办公用信源豆豆